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Introduction 




C. Robert Morgan 

Senior Consulting Engineer and 
Technical Program Manager, 
C«re Technology Croup 



The complexity of high-performance 
systems and the need for ever-increased 
performance to be gained from those 
systems creates a challenge for engi- 
neers, one that requires both experience 
and innovation in the development 
of software tools. The papers in this 
issue of the Journal are a few selected 
examples of the work performed 
within Compaq and by researchers 
worldwide to advance the state of the 
art. In fact, Compaq supports rele- 
vant research in programming lan- 
guages and tools. 

Compaq has been developing 
high-performance tools for more 
than thirty years, starting with the 
Fortran compiler for the DIGITAL 
PDP-10, introduced in 1967. Later 
compilers and tools for VAX com- 
puter systems, introduced in 1977, 
made the VAX system one of the most 
usable in history. The compilers and 
debugger for VAX/VMS are exem- 
plar)'. With the introduction of the 
VAX successor in 1992, the 64-bit 
RISC Alpha systems, Compaq has 
continued the tradition of developing 
advanced tools that accelerate appli- 
cation performance and usability for 
system users. The papers, however, 
represent not only the work of 
Compaq engineers but also that of 
researchers and academics who are 
working on problems and advanced 
techniques of interest to Compaq, 

The paper on characterization of 
system workloads by Casmira, Hunter, 
and Kaeli addresses the capture of 
basic data needed for the development 
of tools and high-performance appli- 
cations. The authors' work focuses 
on generating accurate profile and 
trace data on machines running the 
Windows NT operating system. 



Profiling describes the point in the 
program that is most frequently 
executed. Tracing describes the 
commonly executed sequence of 
instructions. In addition to helping 
developers build more efficient 
applications, this information assists 
designers and implementers of future 
Windows NT systems. 

Every compiler consists of two 
components: the front end, which 
analyzes the specific language, and 
the back end, which generates opti- 
mized instructions for the target 
machine. An efficient compiler is a 
balance of both components. As lan- 
guages such as C++ evolve, the com- 
piler front end must also evolve to 
keep pace. C++ has now been stan- 
dardized, so evolutionary changes 
will lessen. However, compiler devel- 
opers must continue to improve 
front-end techniques for implement- 
ing the language to ensure ever better 
application performance. An impor- 
tant feature of C++ compiler develop- 
ment is C++ templates. Templates 
may be implemented in multiple 
ways, with varying effects on appli- 
cation programs. The paper by 
Itzkowitz and foltan describes 
Compaq's efficient implementation 
of templates. On a related subject, 
Rotidior, Harris, and Davis describe 
a systematic approach Compaq has 
developed for monitoring and 
improving C++ compiler perfor- 
mance to miiiimize cost and maxi- 
mize function and reliability. 

Improved optimization techniques 
for compiler back ends are presented 
in three papers. In the first of diese, 
Reinig addresses the requirement in 
an optimizing compiler for an accu 
rate description of the variables and 
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fields that may be changed by an 
assignment operation, and describes 
an efficient technique used in the 
C/C++ compilers for gathering this 
information. Sweany, Carr, and Huber 
describe techniques for increasing 
execution speed in processors like 
the Alpha that issue multiple instruc- 
tions simultaneously. The technique 
reorders the instructions in the pro- 
gram to increase the number of 
instructions that are simultaneously 
issued. Maximizing the performance 
of multiprocessor systems is the sub- 
ject of the paper by Hall et al., which 
was previously published in IEEE 
Computer and updated with an 
addendum for this issue. The authors 
describe the SUIF compiler, which 
represents some of the best research 
in this area and has become the basis 
of one part of the ARPA compiler 
infrastructure project. Compaq 
assisted researchers by providing the 
DIGITAL Fortran compiler front end 
and an AlphaServer 8400 system. 

As compilers become more effec- 
tive in increasing application program 
performance, the ability to debug 
the programs becomes more difficult. 
The difficulty arises because the 
compiler gains efficiency by reorder- 
ing and eliminating instructions. 
Consequently, the instructions for 
an application program are not easily 
identifiable as part of any particular 
statement. The debugger cannot 
always report to the application pro- 
gram where variables are stored or 
what statement is currently being 
executed. Application programmers 
have two choices: Debug an unopti- 
mized version of the program or find 
some other technique for determining 
the state of the program. The paper 



by Brender, Nelson, and Arsenault 
reports an advanced development 
project at Compaq to provide tech- 
niques for the debugger to discover 
a more accurate image of the state of 
the program. These techniques are 
currently being added to Compaq 
debuggers. 

One of the problems that tool 
developers face is increasing tool reli- 
ability. Tool developers, therefore, 
test the code. However, developers 
are often biased; they blow how their 
programs operate, and they test cer- 
tain aspects of the code but not oth- 
ers. The paper by McKeemaji describes 
a technique called differential testing 
that generates correct random tests of 
tools such as compilers. The random 
nature of the tests removes the devel- 
opers' bias. The tool can be used for 
two purposes: to improve existing 
tools and to compare the reliability 
of competitive tools. 

The High Performance Technical 
Computing Group and the Core 
Technology Group within Compaq 
are pleased to help develop this issue 
of the Journal. Studying the work 
performed within Compaq and by 
other researchers worldwide is one 
way that we remain at the cutting 
edge of technology ofprogramming 
language, compiler, and program- 
ming tool research. 
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Foreword 




William C. Blake 

Director, High Performance 
Technical Computing and 
Core Tecbn*l*gy Groups 



You might think that the cover of this 
issue of the Digital Technical Journal 
is a bit odd. After all, what could be 
die relevance of those ancient alchemists 
in the drawing to the computer-age 
topic of programming languages and 
tools? Certainly, both alchemists and 
programmers work busily on new 
tools. An even more interesting 
metaphorical connection is the 
alchemist and the compiler software 
developer as creators of tools that 
transform (transmute, in the strict 
sense of alchemy) the base into the 
precious. The metaphor does, how- 
ever, breakdown. Unlike the myth 
and folklore of alchemy, the science 
and technology of compiler software 
development is a real and important 
part of processing a new solution or 
algorithm into the correct and high- 
est performance set of actual machine 
instructions. This issue of the Journal 
addresses current, state-of-the-art 
work at Compaq Computer Corp- 
oration on programming languages 
and tools. 

Gone are the days when program- 
mers plied their craft "close to the 
machine," that is, working in detailed 
macliine instructions. Today, system 
designers and application developers, 
driven by the pressures of time to 
market and technical complexity, 
must express their solutions in terms 
"close to the programmer" because 
people think best in ways that are 
abstract, language dependent, and 
machine independent. Enhancing 
the characteristics ofan abstract 
high-level language, however, con- 
flicts with the need for lower level 
optimizations that make the code 
run fastest. Computers still require 
detailed machine instructions, and 



the high-level programs close to the 
programmer must be correctly com- 
piled into those instructions. This 
semantic gap between programming 
languages and machine instructions is 
central to the evolution of compilers 
and to microprocessor architectures 
as well. The compiler developer's role 
is to help close the gap by preserving 
the correctness of the compilation 
and at the same time resolving the 
trade-offs between the optimizations 
needed for improvements "close to 
the programmer" and those needed 
"close to the machine." 

To put the work described in this 
Journal into context, it is helpful to 
think about the changes in compiler 
requirements over the past 15 years. 
It was in the early 1980s that the direc- 
tion of future computer architectures 
changed from increasingly complex 
instruction sets, CISC, that supported 
high-level languages to computer 
architectures with much simpler, 
reduced instruction sets, RISC. Three 
key research efforts led the way: the 
Berkeley RISC processor, the IBM 
801 RISC processor, and the Stanford 
MIPS processor. All three approaches 
dramatically reduced the instruction 
set and increased the clock rate. The 
RISC approach promised improve- 
ments up to a factor of five compared 
with CISC machines using the same 
manufacturing technology. Compaq's 
transition from the VAX to the Alpha 
64-bit RISC architecture was a direct 
result of the new architectural trend. 

As a consequence of these major 
architectural changes, compilers and 
their associated tools became signifi- 
cantly more important. New, much 
more complex compilers for RISC 
machines eliminated the need for the 
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large, microcoded CISC machines. 
The complexities of high-level lan- 
guage processing moved from the 
petrified software of CISC micro- 
processors to a whole new generation 
of optimizing compilers. This move 
caused some to claim that RISC really 
stands for "Relegate Important Stuff 
to Compilers." 

The introduction of the third-gen- 
eration Alpha microprocessor, the 
21264, demonstrates that the shift to 
RISC and Alpha system implementa- 
tions and compilers served Compaq 
customers well by producing reliable, 
accurate, and high-performance com- 
puters. In fact, Alpha systems, which 
have the ability to process over a bil- 
lion 64-bit floating-point numbers 
per second, perform at levels formerly 
attained only by specialized super- 
computers. It is not surprising that 
the Alpha microprocessor is the most 
frequently used microprocessor in the 
top 500 largest supercomputing sites 
in the world. 

After reading through the papers 
in this issue, you may wonder what is 
next for compilers and tools. As phys- 
ical limits curtail the shrinking of sili- 
con feature sizes, there is not likely to 
be a repeat of the performance gains 
at the microprocessor level, so atten- 
tion will turn to compiler technology 
and computer architecture to deliver 
the next thousandfold increase in sus- 
tained application performance. The 
two principal laws that affect dramatic 
application performance improve- 
ments are Moore's Law and Amdahl's 
Law. Moore's Law states that perfor- 
mance will double each 18 months 
due to semiconductor process scaling; 
and Amdahl's Law expresses the 
diminishing returns of various system 



speedup enhancements. In the next 
1 5 years, Moore's Law may be stopped 
by the physical realities of scaling lim- 
its. But Amdahl's Law will be broken 
as well, as improvements in parallel 
language, tool development, and new 
methods of achieving parallelism will 
positively affect the future of compil- 
ers and hence application performance. 
As you will see in papers in this issue, 
there is a new emphasis on increasing 
execution speed by exploiting the 
multiple instruction issue capability of 
Alpha microprocessors. Improvements 
in execution speed will accelerate dra- 
matically as future compilers exploit 
performance improvement techniques 
using new capabilities evolved in Alpha. 
Compilers will deliver new ways of 
hiding instruction latency (reducing 
the performance gap between vector 
processors and RISC superscalar 
machines), improved unrolling and 
optimization of loops, instruction 
reordering and scheduling, and ways 
of dealing with parallel decomposi- 
tion and data layout in nonuniform 
memory architectures. The challenges 
to compiler and tool developers will 
undoubtedly increase over time. 

By not relying on hardware 
improvements to deliver all the 
increases in performance, compiler 
wizards are making their own contri- 
butions — always watchful of correct- 
ness first, then run-time performance, 
and, finally, speed and efficiency of the 
software development process itself. 
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Tracing and 
Characterization of 
Windows NT-based 
System Workloads 



Jason P. Casmira 
David P. Hunter 
David R. Kadi 



To optimize the design of pipelines, branch pre- 
dictors, and cache memories, computer archi- 
tects study the characteristics of benchmark 
programs by examining traces, i.e., samples of 
program execution. Since commercial desktop 
applications are increasingly dependent on ser- 
vices and application programming interfaces 
provided by the host operating system, the 
authors argue thattraces from benchmark exe- 
cution must capture operating system execution 
in addition to native application execution. 
Common benchmark-based workloads, how- 
ever, lack operating system execution. This 
paper discusses the ongoing joint efforts of the 
Northeastern University Computer Architecture 
Research Laboratory and Compaq Computer 
Corporation's Advanced and Emerging Tech- 
nologies Advanced Development Group to cap- 
ture operating system-rich traces on Alpha- 
based machines running the Windows NT oper- 
ating system. The authors describe the latest 
PatchWrx software toolset and demonstrate its 
trace-generating capabilities by characterizing 
numerous applications. Included is a discussion 
of the fundamental differences between using 
traces captured from common benchmark pro- 
grams and using those captured on commercial 
desktop applications. The data presented 
demonstrates that operating system execution 
can dominate the overall execution time of 
desktop applications such as Microsoft Word, 
Microsoft Visual C/C++, and Microsoft Internet 
Explorer and that the characteristics of the 
operating system instruction stream can be 
quite different from those typically found in 
benchmarking workloads. 



The computer architecture research community com- 
monly uses trace-driven simulation in pursuing 
answers to a variety of design issues. Architects spend a 
significant amount of time studying the characteristics 
of benchmark programs by examining traces, i.e., sam- 
ples taken from program execution. Popular bench- 
mark programs include the SPEC and the BYTEmark 2 
benchmark test suites. Since the underlying assump- 
tion is that these programs generate workloads that 
represent user applications, today's computer designs 
have been optimized based on the characteristics of 
these benchmark programs. 

Although the authors of popular benchmarks are 
well intentioned, the resulting workloads lack operat- 
ing system execution and consequently do not repre- 
sent some of the most prevalent desktop applications, 
e.g., Microsoft Word, Microsoft Visual C/C++, and 
Microsoft Internet Explorer. Such applications make 
heavy use of application programming interfaces 
(APIs), which in turn execute many instructions in the 
operating system. As a result, the overall performance 
of many desktop applications depends on efficient 
operating system interaction. Clearly operating system 
overhead can greatly reduce the benefits of a new 
computer design feature. Past architectural studies, 
however, have generally ignored operating system 
interaction because few tools can generate operating 
system-rich traces. 

This paper discusses the ongoing joint efforts of 
Northeastern University and Compaq Computer 
Corporation to capture operating system-rich traces on 
DIGITAL Alpha-based machines running the Microsoft 
Windows NT operating system. We argue that for traces 
of today's workloads to be accurate, they must capture 
the operating system execution as well as the native appli- 
cation execution. This need to capture complete pro- 
gram trace information has been a driving force behind 
the development and use of software tools such as the 
PatchWrx dynamic execution-tracing toolset, which we 
describe in this paper. 

The PatchWrx toolset was originally developed by 
Sites and Perl at Digital Equipment Corporation's 
Systems Research Center. They described PatchWrx, as 
developed for Windows NT version 3.5, in "Studies of 
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Windows NT Performance Using Dynamic Execution 
Traces."' The Northeastern University Computer 
Architecture Research Laboratory and Compaq's 
Advanced and Emerging Technologies Advanced 
Development Group continue to develop the toolset. 
We have updated the framework to operate under 
Windows NT version 4.0, added the ability to trace 
programs that have code sections larger than 4 mega- 
bytes (MB), added multiple trace buffer sizes, and 
developed additional postprocessing tools. 

After briefly discussing related tracing tools, we 
describe the PatchWrx toolset and specify the new 
features we have added. We then analyze PatchWrx 
traces captured on Windows NT version 4.0, demon- 
strating the capabilities of the tool while illustrating 
the importance of capturing operating system-rich 
traces. In the final section, we summarize the paper, 
discuss the current limitations of the toolset, and sug- 
gest new directions for development and study. 

Trace Generation Tools 

Trace-driven simulation has been the method of 
choice for evaluating the merits of various architec- 
tural trade-offs." Traces captured from the system 
under test are recorded and replayed through a model 
of the proposed design. Computer architecture 
researchers have proposed methodologies that capture 
both application and operating system references. 
These tools include hardware- based 6 10 and software- 
based" methods. Some of the issues involved in cap- 
turing operating system-rich traces are 

1. Tracing overhead (system slowdown) 

2. Accuracy (perturbation of die memory address space) 

3. Completeness (capturing all desired information, 
e.g., the operating system reference stream) 

Table 1 contains a list of 10 tracing tools that have 
been developed over the past 10 to 15 years. Although 



far from complete, this list provides a sample of the 
tools that have been used to generate input to a variety 
of trace-driven simulation studies. We have character- 
ized each tool in terms of the diree issues (criteria) pre- 
viously mentioned. Table 1 lists the target platform(s) 
for each tracing tool. 

Note that many of these tools cannot capture oper- 
ating system activity. For those that can, their associ- 
ated slowdown can significantly affect the accuracy of 
the captured trace. Of the tools that provide this capa- 
bility, PatchWrx introduces the least amount of slow- 
down yet maintains the integrity of the address space. 
The next section discusses the PatchWrx toolset. 

PatchWrx 

PatchWrx is a dynamic execution-tracing toolset 
developed for use on the Alpha-based Microsoft 
Windows NT operating system. The toolset utilizes 
the Privileged Architecture Library (PAL) facility, also 
referred to as PALcode, of the Alpha microprocessor 
to perform tracing with minimal overhead. 2 ' PatchWrx 
can instrument, i.e., patch, all Windows NT applica- 
tion and system binary images, including the kernel, 
operating system services, drivers, and shared libraries. 
The PAL facility is a set of architected functions and 
instructions that provides a consistent interface to a set 
of complex system functions. These routines provide 
primitives for memory management, context switch- 
ing, interrupts, and exceptions. 

PatchWrx and the Alpha PAL Routines 

The PatchWrx software tool is made possible through 
the PAL used by DIGITAL Alpha microprocessors. 
PAL routines have access to physical memory and 
internal hardware registers and operate with interrupts 
disabled. PALcode is loaded from disk at system boot 
time. We modified and extended the shrink-wrapped 
Alpha PALcode on a DIGITAL Alpha 21064-based 
system to support the PatchWrx operations. The mod- 



Table 1 

Sample of Tracing Tools 





Average 


Address 


Operating 




Name 


Slowdown 


Perturbation 


System Activity 


Platform 


ATOM' 3 


10X to 100X 


No 


Yes 


DIGITAL Alpha UNIX 


ATUM' 6 


20X 


No 


Yes 


DIGITAL VAX OpenVMS 


EEL" 


10X to 100X 


Yes 


No 


SPARC Solaris 


Etch' 8 


35X 


Yes 


No 


Intel x86 Microsoft Windows NT V4.0 


NT-Atom" 


10Xto 100X 


No 


No 


DIGITAL Alpha Microsoft Windows NT V4.0 


PatchWrx 3 


4X 


No 


Yes 


DIGITAL Alpha Microsoft Windows NTV4.0 


Pixie 20 


10X to 100X 


Yes 


No 


DIGITAL MIPS ULTRIX 


QPT' 2 


10Xto 100X 


Yes 


No 


SPARC Solaris, DIGITAL ULTRIX 


Shade 2 ' 


6X 


No 


No 


SPARC Solaris 


SimOS" 1 


10Xto 50.000X 


No 


Yes 


DIGITAL Alpha UNIX, SGI IRIX, SPARC Solaris 
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ilicd PatchVVrx PAL routines serve two major pur- 
poses: ( 1 ) to reserve the trace buffer at system boot 
time and (2) to log trace entries at trace time. 

One way that PatchVVrx maintains a low operating 
overhead is to store the captured trace in a physical 
memory buffer, which is reserved at boot time. The 
size of the buffer can be varied depending on the 
amount of physical memory installed on the system. 
Since we use PAL routines to reserve this memory, the 
operating system is not aware that the memory exists 
because the PALcode performs all low-level system ini- 
tialization before the operating system is started. 

PatchVVrx logs all trace entries in this buffer. Writing 
trace entries directly to physical memory has several 
advantages. First, writing to memory is much taster 
than writing to disk or to tape. Second, using physical 
memory allows tracing of the lowest levels of the oper- 
atingsystem (i.e., the page fault handler) without gen- 
erating page faults. Third, using physical memory 
allows tracing across multiple threads running in mul- 
tiple address spaces regardless of which address space is 
currently running. 

To enable PatchVVrx to operate under Windows NT 
versions 3.51 and 4.0, wc started with the PAL rou- 
tines modified by Sites and Peri' and made additional 
modifications as required by the operating system ver- 
sions. These modifications were concentrated in the 
process data structures. The PatchWrx-specific PAL 
routines are listed in Table 2. The first three routines 
are used for reading the trace entries from the buffer 
and for turning tracing on and off The remaining five 
routines are used to log trace entries based on the type 
of instruction instrumented. 

PatchVVrx Image Instrumentation 

Next we describe how we use PatchVVrx to instrument 
Microsoft Windows NT images. Patching the operat- 
ing system involves the instrumentation of all the 
binary images, including applications, operating sys- 
tem executables, libraries, and kernel. Once patching 
is complete, trace entries are logged by means of PAL 
routines as images execute. 



We define a patched instruction as an instruction 
within an image's code section that is overwritten with 
an unconditional branch (BR) to a patch. The target of 
the BR contains the patch section. The patch section 
includes the trap (CALL_PAL) to the appropriate PAL 
routine that logs a trace entry corresponding to the 
type of instruction patched and the return branch to 
the original target. 

PatchVVrx does not modify the original binary 
images; instead, it generates new images that contain 
patches. This operation preserves the original images 
on the system in case they need to be restored. 
Instrumentation involves replacing all branching 
instructions of type unconditional branch, conditional 
branch (e.g., branch if equal to zero [BEQ]), branch 
to subroutine (BSR), function return (RET), jump 
(JMP), and jump to subroutine (JSR) within an 
image's code section with unconditional branches to 
a patch section. If loads and stores are also traced, 
PatchVVrx replaces these instructions (e.g., load sign- 
extended longword [LDL]) with unconditional 
branches to the patch section, where the original load 
or store instruction is copied. A return branch is also 
needed to return control flow to die instruction subse- 
quent to the original load. When PatchVVrx encoun- 
ters this patch, the tool records the register value of the 
original load or store instruction in the trace log. The 
patch section contains all the patches for the image 
and is added to the rewritten image. Figure 1 shows 
examples of patched instructions. PatchVVrx replaces 
only branch instructions within an image to reduce the 
type and number of entries logged in the trace buffer. 
Using these traced branches, the tool can later recon- 
struct the basic blocks they represent. 

As shown in Figure 1, PatchVVrx replaces BR and 
JMP instructions with BR instructions that transfer 
control to the patch section. The original BR or JMP 
instruction is repeated in the patch section for the pur- 
pose of recording the value of the target register (if 
necessary) into the trace buffer when the patched 
image is executed. This register value is necessary f»r 
reconstructing the traced instruction stream. PatchVVrx 



Table 2 




PatchWrx-specific PAL Routines 


PAL Routines 


Function 


PWRDENT 


Read a trace entry from trace memory 


PWPEEK 


Read an arbitrary location (for debug) 


PWCTRL 


Initialize, turn tracing on/off 


PWBSR 


Record a branch to subroutine 


PWJSR 


Record a jump/call/return 


PWLDST 


Record a load/store base register value 


PWBRT 


Record a conditional branch taken bit 


PWBRF 


Record a conditional branch fall-through bit 
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ORIGINAL CODE 



PATCHED CODE 



EXAMPLE 1 



JMP ZERO, (R19) 



JMF ZERO, (RIO) BR PATCH . 001 



PATCH. 001 : 



CALL_PAL PWJSR 
JMP ZERO, (R19) 



EXAMPLE 2 



J3R R26, (R19) 



BSR R26, PATCH. 002 



PATCH .102: CALL_PAL PWJSR 

JMP ZERO, (R19) 



EXAMPLE 3 



BEQ RETARGET. 00 3 



BEQ R 3, TARGET. 002 BR PATCH. 003 
BACK. 003 



PATCH. 003 : 



PATCH. 003T: 



BEQ R2, PATCH. 003T 
CALL_PAL PWBRF 
BR BACK .003 

CAU._PAt, PWBRT 
BR TARGET. 003 



EXAMPLE 4 



LIL R20,4(R16) 



LDL R20,4(R1G) BR PATCH. 004 
BACK. 004 



PATCH. 004: 



CALL_PAL PWLDST 
L»L R20, 4 (R16) 
BR BACK. 004 



Figure 1 

Instruction Patch Examples 



replaces JSR and BSR instructions with BSR patches. 
This replacement preserves the return address (RA) 
register field value, which contains the return address 
for the subroutine. Again, the original instruction is 
repeated in the patch section for register value record- 
ing during tracing to help facilitate reconstruction. 

Conditional branches have a larger and more com- 
plex patch than the other branch types because the 
original condition is duplicated and resolved within 
the patch. The taken or fall-through path generates a 
bit value when logged within the taken or fall-th rough 
trace entry. The return branch in the patch section is a 
replica of the original conditional branch. 

As explained earlier, for all patches, PatchWrx replaces 
the original branch with a patch unconditional branch. 
Since Alpha instructions are equal in size, this replace 
ment process allows patching without increasing the 
code size within the image. Although the code size 
remains unchanged, the image size will increase in 
proportion to the number of patches added. This 



image size change becomes an issue for dynamically 
linked library (DLL) images. 

Patching Dynamic Link Libraries 

The Microsoft Windows NT operating system pro- 
vides a memory management system that allows shar- 
ing between processes. 23 For example, two processes 
that edit text files can share the text editor application 
image that has been mapped into memory. When the 
first process invokes the editor, the operating system 
loads the application into memory and maps the 
process's virtual address space to it. When the second 
process invokes the editor, rather than load another 
editor image, the operating system maps the second 
process's virtual address space to the physical pages 
that contain the editor. Of course, both processes con- 
tain local storage for private data. 

DLLs are loaded into memory and shared in this 
manner. When patches are added to a DLL, the size of 
the image increases. When this image is mapped to 
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physical memory (as per its preferred base load 
address), the larger image may overlap with another 
image having a base address within the new range. 
This image overlap can prevent the operating system 
from booting properly: some environment DLLs will 
conflict in memory because they perform calls directly 
into other DLLs at fixed offsets. To resolve this issue, 
we rebase- 4 the preferred base load addresses of the 
patched DLLs, which modifies the base load addresses 
of each patched DLL to eliminate conflicts. Rebasing 
affects the address accuracy of the patched system, 
though we are able to readjust the addresses during 
reconstruction. An increase in the paging activity may 
also be observed since the additional code may cross 
page boundaries. 

The original version of the PatchVVrx toolset was 
developed on Microsoft Windows NT version 3.5. 
When versions 3.51 and 4.0 were released, several mod- 
ifications were made to the image format. In complet- 
ing the 3.51- and 4.0-compatible versions of PatchVVrx, 
we had to address this issue. One change that affected 
how we patch was the placement of the Import Address 
Table (IAT) into the front of the initial code section of 
executable binary images. This table is used to look up 
the addresses of DLL procedures used (i.e., imported) 
by the executable binary. In developing the current gen- 
eration of PatchVVrx, we had to make modifications to 
use image header fields that had previously remained 
unused or reserved, indicating the executable code sec- 
tions that contained data areas. 

Another issue that we addressed in the recent modi- 
fications to PatchVVrx was long branches. The original 
version of PatchVVrx replaces all branch, jump, call, 
and return instructions with either BRorBSRinstruc- 
tions to the patch section. Since the PatchVVrx tool has 
no information about machine state during the patch- 
ing phase, it is impossible to utilize other branching 
instructions (e.g., JMP or JSR instructions) to provide 
this branch-to-patch transition. Register and register- 
indirect branching instructions would require per- 
turbing the machine state. Therefore, the developers 
could use only program counter (PC)-based offset 
branching instructions. 

As discussed previously, in replacing a control flow 
instruction with a patch branch, PatchVVrx uses a BR 
or BSR instruction in which the offset field is set to 
branch to the corresponding patch within the image's 
patch section. The Alpha architecture branching 
instructions use the format shown in Figure 2. 





OPCODE 


REG 


21-BIT DISPLACEMENT 




31 26 25 21 20 




0 



Figure 2 

Alpha Branch Instruction Format 
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The branch target virtual address computation for 
this format is newPC = (oldPC + 4) + (4 * sign- 
extended(21-bit branch displacement)). The register 
field holds the return address for BSRs. With this 
branch format and target virtual address computation, 
the Alpha architecture provides a branch target range 
of 4 MB from an instruction's current PC. 

Several applications that run todav on Microsoft 
Windows NT version 4.0 are sufficiently large that the 
displacement between a control flow instruction to be 
patched and the patch location within the patch section 
exceeds this 4-MB limit. (Recall that since we want to 
avoid moving code or data sections, the patch section is 
placed at the end of the image.) To address this problem, 
we developed two new branch instructions for use with 
PatchVVrx. These new branches were not implemented 
in the instruction set architecture of the Alpha architec- 
ture. I nstead, we used PALcode to implement them. The 
two new branches are designated long branch ( LBR) and 
long branch subroutine ( LBSR). Figure 3 illustrates the 
format of these two instructions. 

The computation of the target virtual address is 
newPC = (oldPC + 4) + (4 * sign-extended(25-bit 
branch displacement)) for LBR branches and newPC = 
(oldPC + 4) + (32 * zero-extended(20-bit branch dis- 
placement)) for LBSR branches. PatchVVrx uses LB lis 
when patching any control flow instruction that has 
a displacement greater than 4 MB. PatchVVrx uses 
LBSRs similarly for control flow instructions that must 
preserve the register field value. 

When an LBR or LBSR instruction is executed 
within the image code section, a trap to PALcode 
occurs. Normally, CALL_PAL instructions have one of 
several defined function fields that cause a correspond- 
ing PAL routine to be executed. The two long branch 
instructions have function fields that do not belong to 
any of the defined CALL_PAL instructions and there- 
fore force an illegal instruction exception within the 
PALcode. This PALcode flow has been modified to 
detect if a long branch has been encountered. 





OPCODE 
000000 


25-BIT DISPLACEMENT 


0 




31 26 25 


LBR INSTRUCTION FORMAT 


1 0 




OPCODE 
000000 


REG 


20-BIT DISPLACEMENT 


1 




31 26 25 


21 20 

LBSR INSTRUCTION FORMAT 


1 0 



Figure 3 

PALcode Long Branch Instruction Formats 



As shown in Figure 3, both long branch types have 
the same PALcode operation code (opcode) value of 
000000. To distinguish between the two types, the least 
significant bit in the instruction word is set to 0 forLBRs 
and to 1 for LBSRs. This bit is not included as a usable 
bit for the displacement fields of either branch type. 
Consequently, each LBRhas a 25-bit displacement field 
and each LBSR has a 20-bit field. With a 25-bit usable 
displacement field, the PALcode performs the LBR tar- 
get address computation, allowing a ±64-MB range. 

Since each LBSR instruction has a 20-bit displace- 
ment field, whereas the original Alpha architecture 
branch displacement fi eld is 21 bits, the target instruc- 
tion address computation for LBSR instructions is per- 
formed differently than for standard branches within 
the PALcode. As shown in the address computation 
equation, the 20-bit displacement is multiplied by 32 
rather than by 4 (as for the LBR branch). Notice that 
the 20-bit displacement is always zero extended. The 
computation provides the LBSR instruction with a dis- 
placement of +32 MB. 

This computation procedure has two implications. 
First, LBSR instructions can only be used to branch 
from an image code section to an image's patch sec- 
tion. Second, branches into the patch section are 
either BR or BSR instructions (or their long displace- 
ment counterparts). PatchWrx uses only BR or LBR 
instructions to return from the patch section to the 
original branch target within a code section; BSR and 
LBSR instructions are never used. Therefore, restrict- 
ing LBSR instructions to use positive displacements 
does not present a problem. 

The LBSR displacement multiplier value of 32 does 
present some restrictions, however. The multiplier 
value of 4 used in the original AJpha instruction set 
architecture represents the instruction word length 
of 4 bytes. Thus, normal branch instruction target 
addresses must be aligned on a 4-byte boundary. By 
using the multiplier value of 32 for LBSR instructions, 
LBSR target addresses are restricted to align on a 32- 
byte (i.e., eight-instruction) boundary. Since all LBSR 
targets reside within the patch section, this restriction 
does not pose a problem. If an LBSR is to be inserted 
into the image code section and the next available 
patch target address is not aligned properly, PatchWrx 
can insert no operation (NOP) instruction words and 
advance the next available patch target address until 
the necessary alignment is achieved. PatchWrx never 
executes the NOPs; they are inserted for alignment 
purposes only. Although inserting these NOP instruc- 
tions increases the image size, we have implemented 
several optimizations into the instrumentation algo- 
rithm to minimize this increase. For example, a queue 
is used to hold LBSRs that do not align. As LBR 
patches are committed, PatchWrx probes the queue to 
determine if any LBSRs align from their origin to the 
newly available patch target offset. 



Trace Capture 

The PatchWrx toolset allows the user to turn tracing on 
and off and thus capture any portion of workload execu- 
tion. The tracing tool is also responsible for copying trace 
entries from the physical memory buffer to disk. Copying 
the trace buffer to disk is performed after tracing has 
stopped so that the time required to perform the copy 
does not introduce any overhead during trace capture. 

PatchWrx logs a trace entry for each patch encoun- 
tered during program execution. As it executes instruc- 
tions witJiin the code section, PatchWrx encounters an 
unconditional PatchWrx branch. Instead of brandling to 
the original target, the patched branch transfers control 
to die image's patch section. Within the patch section, a 
PatchWrx PALcaJl traps to the PAL routine correspond- 
ing to the patch type and logs a trace entry to due trace 
buffer. The PAL routine then returns to the instruction 
following the CALL_PAL instruction. PatchWrx uses an 
unconditional branch to transfer control from the patch 
section back to the original target within an image code 
section. During the execution of the PatchWrx PAL rou- 
tine, necessary machine state information is recorded 
and logged in the trace buffer. This allows for the capture 
of register contents, process ID information, etc., which 
are used later during trace reconstruction. 

The trace capture facility captures the dynamic execu- 
tion of a workload running on the system. To recon- 
struct the trace after it has been captured, the tracing 
tool must also capture a snapshot of the base load 
addresses of all active images on the system. This snap- 
shot serves as the virtual address map used in recon- 
structing the trace. Each active process and its associated 
libraries is loaded into a separate address space, which 
may be different than the preferred load address as spec- 
ified statically in the image header. If each image was 
loaded into memory at its preferred base address, the 
virtual address map would not be necessary to perform 
reconstruction. Instead, PatchWrx could map target 
addresses from the trace buffer using the base address 
values contained in the static image headers. 

The type of trace record that PatchWrx logs into the 
trace buffer depends on the type of branch or low-level 
PAL function being traced. Figure 4 shows the trace 
record formats. The first three trace entry formats 
consist of an 8-bit opcode and a 24-bit time stamp. 
The time stamp is the low-order 24 bits of the CPU 
cycle counter. The 32-bit field of these three formats 
depends on the type of trace entry logged. The first 
format is used for target virtual addresses for all 
unconditional direct and indirect branches, jumps, 
calls, returns, interrupts, and returns from interrupts. 
The 32-bit field of the second format is used to record 
the base register value for traced load and store 
instructions and stack pointer values that are flushed 
into the trace buffer during system calls and returns. 
The 32-bit field of the diird format is used for logging 
the current active process ID at a context swap. 
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OPCODE 


TIME STAMP 


TARGET PC 


8 24 


32 


OPCODE 


TIME STAMP 


BASE REGISTER VALUE 


8 


24 


32 


OPCODE 


TIME STAMP 


NEW PROCESS ID 


8 24 32 

1 OPCODE 

— START BIT 






VECTOR OF 60 TAKEN/FALL-THROUGH TWO-WAY BRANCH BITS 



3 1 60 



Figure 4 

Trace Entry Formats 



The fourth trace entry type is used for tracing con- 
ditional branches. It uses a 3-bit opcode and up to 60 
taken/fall-through bits. A start bit is used to deter- 
mine how many bits are active. The start bit is set to 
1 if a conditional branch is taken and to 0 if the branch 
is not taken. This recording scheme allows a compact 
encoding of conditional branch trace entries. During 
trace reconstruction, PatchVVrx uses conditional branch 
trace entries to reconstruct the correct instruction 
flow when conditional branches are encountered and 
to provide concise information about when to deliver 
interrupts in loops. 

Trace Reconstruction 

The reconstruction phase is the final step in generating 
a full instruction stream of traced system activity. As 
shown in Figure 5, trace reconstruction requires sev- 
eral resources in order to generate an accurate instruc- 
tion stream of all traced system activity. 

Trace reconstruction reads and initializes the head- 
ing of the captured trace, which includes a time stamp, 
the name of the user who captured the trace, and any 
important system configuration information, e.g., the 
operating system version number. Next, reconstruc- 
tion reads the first four raw trace records, which are 
automatically entered whenever tracing is turned on. 
These records contain the first target virtual address, 
the active process ID, the value of the stack pointer, 
and the first taken/fall-through record to be used 
(such records always precede the branches they repre- 
sent). PatchWrx uses this information to initialize the 
necessary data structures of the reconstruction process. 



Using the first target virtual address and process ID 
pair from the captured trace, trace reconstruction con- 
sults the virtual address map to determine in which 
image the instruction falls (based on its dynamic base 
load address) and where that image is physically 
located on the system. The tool consults the patched 
image to determine the actual instruction at the target 
address, records this instruction, and then reads the 
next instruction from the patched image. This process 
continues until reconstruction encounters either a 
conditional branch or an unconditional branch. A 
conditional branch causes the tool to check the first 
active bit of the current taken/fall-through entry to 
determine subsequent control flow; the process then 
continues at diat address. I fan unconditional branch is 
encountered, reconstruction records the entry and 
checks it against the next captured trace entry. If the 
two entries match, the tool outputs the recorded 
instructions to an instruction stream file, consults the 
captured trace entry for die next target instruction vir- 
tual address, and repeats the procedure until the entire 
captured trace has been processed. 

Since PatchVVrx captures interrupts and other low- 
level system activities (e.g., page faults) in the trace, 
these activities must also be reconstructed. When 
PatchVVrx logs an interrupt into the trace buffer, the 
corresponding target virtual address in the captured 
record represents the address of the first instruction 
wo/executed when the interrupt was taken. PatchVVrx 
flushes the currently active taken/fall-through entry 
to the memory buffer and initializes a new taken/fall- 
through entry. This new entry will be responsible for 
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Instruction Stream Reconstruction Resources 



the conditional branches encountered beginning with 
the interrupt service routine. The address of the first 
instruction within the interrupt service routine is then 
logged in the trace. 

During reconstruction, the reconstruction tool looks 
for the interrupt's First unexecuted instruction address 
to know which instruction to stop at when recon- 
structing the instruction stream. The tool then begins 
reconstructing the instruction stream, including the 
interrupt handler stream. If the unexecuted instruc 
tion is within a loop, trace reconstruction utilizes the 
taken/fall-through entry convention. On taking the 
interrupt, the active taken/fall-through record is flushed 
and another record is started. This process allows the 
tool to continue to reconstruct iterations of the loop 
until all the taken/fall-through bits are exhausted. 

Operating System-Rich Workload 
Characterization 

As presented in the study by Lee et al .,' s desktop appli- 
cations and benchmarks share some workload charac- 
teristics, but applications alone do not represent full 
system behavior. To investigate and address system 
design issues, computer architects should use operat- 
ing system-rich traces. 

To illustrate this point, we present a sample of the 
various workload characteristics that exist in a set of 
benchmark and desktop applications specially selected 
to study the differences in the use of the operating sys- 
tem and related services. The first characteristic we dis- 
cuss is the amount of time each benchmark or desktop 
application spends within three domains: 

1. Application-only domain (e.g., winword.exe and 
excel.exe) 



2. DLL domain — Win32 user (e.g., kernel32.dll, 
user32.dll, and ntdll.dll) 

3. Operating system domain — Win32 kernel, kernel, 
system processes, system idle process (e.g., 
Win32K.sys, ntoskrnl.exe, drivers, and the spooler) 

Examining these times provides insight into a work- 
load's use of each domain. We also examine DLL and 
system service usage on an image basis for each work- 
load. This breakdown helps us more clearly identify the 
dependence between the workload and the system ser- 
vices provided by the Windows NT operating system. 

We also present the instruction mix of each workload 
with and without the inclusion of the operating system 
execution. Understanding the differences in instruc- 
tion composition in the presence of system activity fur- 
ther highlights the behavior lacking in application-only 
traces, such as increases in branch and memory instruc- 
tions, when compared to application-only workloads. 
We present the average basic block lengths for each 
domain of execution (application-only, DLL, operating 
system) separately and then in combination. This met- 
ric reveals which workload domain dominates the 
branching behavior. Casmira's work provides a more 
complete description of these differences across a wider 
set of workload characteristics." 

Workload Descriptions 

We performed all the experiments reported on in this 
paper on a DIGITAL Alpha platform running the 
Microsoft Windows NT version 4.0 operating system. 
We captured the traces on a 1 50-megahertz Alpha 
21064 processor. The system configuration included 
80 MB of physical memory. Table 3 lists the workloads 
we examined. 
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Table 3 

Workload Description 



Workload Description 

fourier BYTEmark benchmark; a numerical analysis routine for calculating series approximations of waveforms 

neural BYTEmark benchmark; a small, functional back-propagation network simulator 

go SPEC95 Go.' game benchmark 

li SPEC95 Lisp interpreter benchmark 

cdplay Microsoft CD Player playing a music CD 

fx!32 DIGITAL FXI32 V1.1 interpreting/translating included OpenGL sample x86 application 

ie Microsoft Internet Explorer V2.0 following a series of web page links 

vc50 Microsoft Visual C/C++ V5.0 compiling a 3,000-line C program 

word Microsoft Word97 V7.0, spell-checking a 1 5-page document 



The fourier and neural workloads are from the 
BYTEmark benchmark test suite: the neural workload 
is a small array- based floating-point test; the fourier 
workload is designed to measure transcendental and 
trigonometric floating-point unit performance. 

The go and li workloads are from the SPEC95 integer 
benchmark suite: the go workload is a simulation of the 
game Go!, with the computer playing against itself; the Li 
workload is a Lisp inteipreter. All the workloads use the 
standard inputs provided with the benchmarks and are 
compiled with the default optimization level using die 
native Alpha version of Microsoft C/C++ version 5.0. 

The cdplay workload is the Microsoft CD Player 
application included in Microsoft Windows NT ver- 
sion 4.0. The device was traced while playing a music 
CD using default playing options (e.g., playing all the 
songs in order). 

The fx!32 workload is the DIGITAL FX!32 version 1 . 1 
emulator/translator provided by Compaq's DIGITAL 
Alpha Migration Tools Group.- 6 We ran the robot arm 
OpenGL sample Intel-based application in the fore- 
ground during trace capture. 

The ie workload is the standard Microsoft Internet 
Explorer version 2.0 workload included in Microsoft 
Windows NT version 4.0. The ie workload was traced 
while traversing four links through the Sony home 
web page, arriving finally at the Sony PlayStation Store 
web page. The trace was captured on May 4, 1998; 
pages may have changed since this date. The history 
cache and the web link cache were both empty when 
the trace was captured. 

The vc50 workload is the Microsoft C/C++ version 
5.0 compiler compiling a 3, 000-linc C source code file. 
We used the command line interface, and we used die 
default optimization levels and odier parameters, which 
best represented the common usage of the compiler. 

The word workload is Microsoft Word from the 
Microsoft Office97 desktop application suite for the 
Alpha processor used to capture a manual spell check 
of a 15 -page Microsoft Word document. The standard 
Microsoft Word dictionary was employed. 



To provide a clear and representative comparison 
of workload behavior, we captured several traces. For 
all scenarios, full traces of each workload captured 
approximately 5 to 10 seconds of execution, filling the 
45-MB trace buffer. To characterize worldoad behav- 
ior, each experiment was run with the benchmark or 
application as the only activity on the system. Each 
workload was run in the foreground. 

To ensure that the traces captured were representa- 
tive of the overall workload behavior, we captured 
multiple traces. We chose different points during exe- 
cution for tracing to allow comparison between differ- 
ent portions of the selected scenarios. To investigate 
the variability present in selected workloads, we traced 
additional scenarios. A second Microsoft Word trace 
was captured with the application performing an auto- 
format operation of the same document used in the 
first trace of the spell-check operation, and we cap- 
tured a second Microsoft Internet Explorer trace, 
repeating the Sony links but with the links cached. We 
captured a second trace of FX!32 using the included 
boggle sample game (for comparison against using the 
OpenGL application input). Additionally, the FXI32 
translator was traced while it optimized a native Intel 
x86 application's profile. To condense the number of 
memory pages occupied by an image, Microsoft 
designed the new linker to allow data to reside within 
the code regions. Hookvvay and Herdeg " provide an 
explanation of the DIGITAL FX! 32 emulation and 
translation/optimization procedures. Casinira discusses 
these scenarios and others : ' 

Domain Mix 

To illustrate the inherent differences between bench- 
mark and desktop application behavior, we break 
down the captured trace in terms of three mutually 
exclusive domains. These domains are ( 1) application, 
(2) DLL, and (3) operating system. The application 
domain represents the set of executed instructions that 
are within the traced application's executable image. 
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The DLL domain represents the instructions executed 
by the application of interest's process but excludes 
the application's executable image. This domain is 
made up of the DLLs, system services, and drivers that 
the application may access during execution. The 
operating system domain includes instructions exe- 
cuted by the kernel or other system support service 
executable images, and all associated DLL and driver 
images. These are the processes, images, and libraries 
that are always present and running on the system. 
Figure 6 displays the breakdown of instructions into 
these three domains. The x-axis lists the workloads, 
and the y-axis presents the percent composition of the 
captured trace. Note that the four benchmarks, i.e., 
fourier, neural, go, li, spend at least 95 percent of their 
execution within their application image. Both the 
fourier and the neural benchmarks spend about 
99 percent of their execution within their application 
image. The go and li benchmarks do exhibit some 
operating system activity, but this activity is due to the 
I/O generated as go displays output as it progresses 
and as li reads input from its standard input file. 

The operating system dominates the execution in 
the cdplay workload. The Microsoft CD Player appli- 
cation is I/O bound, relying heavily on the necessary 
services provided by the operating system and the 
DLLs to access the CD hardware. While waiting for 
I/Os to complete, the system activity is composed 



almost completely of the kernel idle loop performing 
busy waiting (recall that each workload investigated is 
the only application running on the system, so there is 
no other work to be done during these periods). 

The fx!32 workload spends nearly all its execution 
time operating within DLLs. The robot arm Intel x86 
OpenGL sample that the DIGITAL FX132 application 
is interpreting heavily exercises the graphics display 
libraries and console display services. 

The ie workload is more evenly distributed across 
the three domains. The moderate amount of operating 
system activity is due to the network and screen display 
I/O and also to the Microsoft Internet Explorer's 
caching of the pages it touches to local disk files. The 
DLL activity is generated by operating system services 
for screen and file I/O and by network service library 
routines. The application image coordinates die usage 
of these routines, and network and display I/O, which 
is frequently encountered during the operations of 
selecting and opening web links. This coordination 
accounts for the high percentage of application domain 
execution exhibited by ie, as shown in Figure 6. 

The vc50 workload spends nearly all its execution 
time within its application image. This phase of the 
compiler is responsible for performing die parsing and 
lexical analysis of the source code file. There is some 
use of DLLs through invoking library routines to load 
included header files. The operating system activity, 
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although small, is present; all I/O must be accessed by 
means of a system service. 

The Microsoft Word spell-checking service is pro- 
vided by means of a DLL included with die application. 
Thus for the word workload, this DLL handles both die 
search through the document and the successive diction- 
ary lookups. Operating system services are required for 
accessing portions of the file residing on disk (not in 
memory pages), for displaying the search and compare 
results to the user, and for performing the user-driven 
I/O associated with accepting/rejecting word replace- 
ment choices (prompted by the spell -checking tool). 

Figure 6 shows the consistent pattern of instruction 
domains that the four benchmarks follow in contrast to 
the variability in die instruction mix domain of die desktop 
application workloads. Even though there is slight operat- 
ing system activity forgo and li (attributable to I/O ser- 
vices), the benchmarks spend practically all dieir execution 
within their application images; no DLL use is visible. 
Clearly diese benchmarks do not utilize system sendees to 
the level observed in the commercial desktop workloads. 
VVidi the exception of the CD player, die commercial 
desktop applications examined use DLLs more heavily 
than they do operating system services. This is especially 
true in die fx!32 and word workloads, which carry out the 
tasks captured in the trace by means of DLL routines. 

Characterization of Image Usage 

To investigate the domains present in the trace at the 
image level, we identified the top five most heavily 
used images, based on the number of instructions exe- 
cuted in each image. First, an explanation of some of 
the more frequently used system executables and 
DLLs is in order. Table 4 lists the names of the com- 
monly used images and a brief description ofeach. 

We present die image usage of the nine traces. This 
characterization includes all the images (e.g., executa- 
bles, DLLs, services, and drivers) listed in Table 5. The 
data helps demonstrate several points. First, commercial 
desktop workloads spend a lot more time in DLLs than 
benchmarks do. Consequently, we can project that the 



number of procedure calls in desktop applications will 
be higher than the number of calls in benchmarks. 
Second, real applications depend not only on system 
DLLs but also on their local DLLs. We see this behavior 
explicitly with the Microsoft Word application. 

Instruction Mix 

Although understanding the domain mix and image 
usage helps identify differences between benchmarks 
and desktop applications, we would like to look deeper 
within each domain to see inherent differences that 
af fect design decisions. Figure 7 shows the application- 
only instruction mix (i.e., the instruction mix for only 
the application and application-specific DLLs) for each 
workload. Each entry in the legend represents a class 
of instructions found within the application domain. 
The y-axis denotes the percent composition of the 
trace; the workloads are displayed on the x-axis. 

Note that the instruction mix for the fx!32 workload 
is zero. This value is a result of the lack of execution 
within the application image itself. Referring back to 
Table 5 and the domain instruction mix, note that 
nearly all the workload execution is within DLLs (some 
execution is within ntoskrnl.exe). The remaining work- 
loads consist mainly of load, store, conditional branch, 
and arithmetic and logic unit (ALU) logic operations. 
No overriding characteristic differentiates benchmarks 
and desktop applications. Note the significant variabil- 
ity in the instruction mix among the different bench- 
marks and among the different desktop applications. 

Figure 8 shows the instruction mix of the entire 
U'ace. The first and most noticeable difference between 
the application domain and full-trace instruction mix 
figures is the increase in instruction types present in 
the trace. Nine instruction classes were present in the 
application domain instruction mixes, while 17 are 
present in the full-system traces. Worth noting is the 
presence offr CALL_PAL instruction types (all use the 
same opcode, but invoke 6 different PAL routines) 
in the full traces. Since each executed CALL_PAL 
instruction causes a trap that takes on the order of tens 
of cycles to complete, we can conclude that this is a 



Table 4 




Common System Images 


Name 


Description 


ntoskrnl.exe 


Windows NT operating system kernel core 


hal.dll 


Hardware Abstraction Library (HAL), which is responsible forthe underlying hardware interface 


kernel32.dll 


Main kernel library 


win32k.sys 


Kernel-mode device driver 


gdi32.dll 


Graphics display interface library 


ntdll.dll 


Library routines provided to each client process on the Windows NT system 


MSVCRT.dll 


Microsoft C/C++ run-time library 


s3.dll 


Graphics adapter library for the test platform 


qv.dll 


Graphics adapter library for the test platform 
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Table 5 

The Five Most Frequently Used Images in Each Application or Benchmark 



Image Name 

Workload (Percentage of Total Number of Instructions Executed within the Image) 



fourier 


bytecpu.exe 


winsrv.dll 


win32k.sys 


ntoskrnl.exe 


user32.dll 


Other 




(99.5%) 


(0.2%) 


(0.1%) 


(0.1%) 


(0.02%) 


(0.08%) 


neural 


bytecpu.exe 


winsrv.dll 


ntoskrnl.exe 


win32k.sys 


ntdll.dll 


Other 




(99.7%) 


(0.2%) 


(0.03%) 


(0.03%) 


(0.02%) 


(0.02%) 


go 


go.exe 


win32k.sys 


ntoskrnl.exe 


hal.dll 


qv.dll 


Other 


(95.5%) 


(2.0%) 


(1.0%) 


(0.4%) 


(0.1 %) 


(1.0%) 


li 


li.exe 


win32k.sys 


ntoskrnl.exe 


user32.dll 


qv.dll 


Other 




(97.7%) 


(1.0%) 


(0.6%) 


(0.1%) 


(0.1 %) 


(0.5%) 


cdplay 


ntoskrnl.exe 


hal.dll 


win32k.sys 


tcpip.sys 


winsrv.dll 


Other 


(81.8%) 


(14.7%) 


(1.1%) 


(0.4%) 


(0.3%) 


(1.7%) 


fx! 32 


hal.dll 


s3.dll 


OPENGL32.DLL 


MSVCRT.dll 


GLU32.dll 


Other 




(42.5%) 


(24.6%) 


(12.2%) 


(11.7%) 


(2.7%) 


(6.3%) 


ie 


iexplore.exe 


win32k.sys 


ntoskrnl.exe 


Fastfat.sys 


ntdll.dll 


Other 




(37.2%) 


(19.3%) 


(17.5%) 


(6.1%) 


(6.0%) 


(13.9%) 


vc50 


d.exe 


ntoskrnl.exe 


MSVCRT.dll 


Ntfs.sys 


win32k.sys 


Other 




(83.1%) 


(10.5%) 


(2.8%) 


(1.2%) 


(1.1%) 


(1.3%) 


word 


MSSP232.DLL 


MSGREN32.DLL 


ntoskrnl.exe 


win32k.sys 


hal.dll 


Other 




(36.4%) 


(34.0%) 


(10.2%) 


(7.7%) 


(4.0%) 


(7.7%) 



significant insight into the system's inherent run-time 
latency, not visible with application-only workloads. 

Next note the striking similarities in instruction 
mix for the four benchmarks in Figures 7 and 8. 
Benchmarks do not interact with the operating system 
in any significant manner. The desktop application 
workloads, however, show significant differences 
between the application domain and the complete 
trace instruction mixes. 

The number of store instructions for the cdplay 
workload decreases from about 1 1 percent to approxi- 
mately 1 percent. The number of BSR instructions 
increases from 1 percent to about 6 percent. Most 
interesting for this application is the decrease in the 
number of ALU operations from almost 30 percent to 
about 2 percent, while the number of CALL_PAL 
instructions increases from 0 to 21 percent. Referring to 
Figure 6, the domain execution mix plots clearly show 
why the differences for this workload are so large when 
the system activity is included — more than 95 percent 
of the workload trace is operating system execution. 

Considering the latency incurred by executing 
CALL_PAL instructions, clearly an optimization that 
concentrates on improving ALU operations based on 
the application domain instruction mixes would have a 
much smaller impact on the true system performance. 
The measured difference in instruction mix under- 
scores the importance not only of using real workloads 
for trace-driven simulations but also of including the 
operating system behavior in order to see the full picture. 

The fx!32 complete trace instruction mix is, of 
course, completely different from the application 
instruction mix of Figure 7, in which no instructions 



were executed within the fx>32 application image. Both 
the ie and the word workloads introduce CALL_PAL 
instructions when including the operating system. The 
ie instruction mix shows an increase in jumps, calls, and 
returns, which most likely reflects the increase in sub- 
routine calls for system services. The word instruction 
mix experiences a reduction in load instructions from 
approximately 52 percent to 35 percent. This decrease 
can be attributed to the increase in ALU operations pre- 
sent when operating system activity is included. 

The results presented in Figures 7 and 8 reinforce 
the points that benchmarks do not represent true desk- 
top workloads and diat die desktop workloads display 
significantly different characteristics when viewed in the 
presence of system activity. 

Average Basic Block Length 

Including the operating system activity in our traces yields 
an overall increase in the percentage of control flow 
instructions present. Figure 9 shows a consequence of 
this fact. In this figure, we present the average basic block 
length for each worldoad, on a per-domain basis. The 
ALL bar is the average basic block length across all 
domains; OS denotes the operating system instructions 
only; DLL denotes the workload's DLL instructions 
only; APPDLL denotes the combined application and 
DLL instructions; and APP denotes die application 
instructions only. 

Inspecting the four benchmarks, we notice little dif- 
ference between the application-only basic block 
length and the overall basic block length. Referring to 
our domain instruction mix figure, recall that the 
benchmarks spend about 95 percent of their execution 
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within their executable images. Therefore, including 
any operating system activity into a basic block length 
average has a minimal effect. 

However, considering the large amount of operat- 
ing system execution present in the cdplay trace, the 
overall basic block length is significantly less than the 
application-only length. The overall and operating 
system length values are almost the same. Not only 
does including the system activity in the trace influ- 
ence the overall basic block length but the amount 
of system activity determines to what degree the length 
is affected. 

In a similar fashion, the overall basic block length of 
the fx!32 trace tracks that of its DLLs. The length is 
directly proportional to the amount of time the work 
load spends in its DLL domain. The execution of the ie 
workload is more evenly distributed among the three 
domains, which affects the overall basic block length, 
producing a more evenly weighted average of all its 
domain basic block lengths (no one domain dominates). 



The vc50 workload spends a significant amount of 
time within its own executable image, which leads to 
an overall average basic block length similar to the 
application-only value. The word workload is similar, 
but the DLL behavior dominates. The cdplay and ie 
workloads experience a 50 percent decrease in average 
basic block length. This decrease can be attributed to 
an increase in the number of branches in the presence 
of operating system activity. With this increase in con- 
trol flow instructions, we expect increased pressure to 
be placed upon the branch prediction hardware. 

As observed in other characteristic categories, the 
four benchmarks do not exhibit noticeable deviations 
from application-only behavior when the operating 
system activity is introduced. Again this explains why 
simulation results using benchmark traces usually track 
the actual performance when the benchmarks are run 
on the real system. In contrast, four of die five desktop 
applications exhibit significantly different behavior in 
the presence of the operating system. 
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Summary 

In this paper we described the Patch Wrx toolset. We 
compared it to existing tools and demonstrated the 
need for operating system-rich traces by showing the 
amount of the total execution spent in the kernel and 
the DLLs. In addition, we showed that existing desk- 
top benchmarks do not exercise the kernel and the 
DLL sufficiently to provide meaningful indicators of 
desktop performance. 

These results have reinforced our argument that 
researchers need to use traces with both application 
and operating system information, especially as new 
applications spend more time executing within the 
operating system. The goal is for computer architects 
to use operating system-rich traces of applications that 
dominate the desktop market. 

We have recently finished modifications to the PAL 
to enable PatchWrx to run on the Alpha 21164 plat- 
form. We plan to study a wider range of desktop appli- 
cations, including database and server applications. 
Future plans also include migrating the toolset to the 
Windows 2000 operating system. 
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Automatic Template 
Instantiation In 
DIGITAL C++ 



Automatic template instantiation in DIGITAL C++ 
version 6.0 employs a compile-time scheme that 
generates instantiation object files into a reposi- 
tory. This paper provides an overview of the C++ 
template facility and the template instantiation 
process, including manual and automatic instan- 
tiation techniques. It reviews the features of 
template instantiation in DIGITAL C++ and 
focuses on the development and implemen- 
tation of automatic template instantiation in 
DIGITAL C++ version 6.0. 



The template facility within the C++ language allows 
the user to provide a template for a class or function 
and then apply specific arguments to the template 
to specify a type or function. The process of applying 
arguments to a template, referred to as template instan- 
tiation, causes specific code to be generated to imple- 
ment the functions and static data members of the 
instantiated template as needed by the program. 
Automatic template instantiation relieves the user of 
determining which template entities need to be instan- 
tiated and where they should be instantiated. 

In diis paper, we review die C++ template facility and 
describe approaches to implementing automatic tem- 
plate instantiation. We follow that with a discussion of 
die facilities, rationale, and experience of die DIGITAL 
C++ automatic template instantiation support. We 
then describe the design of the DIGITAL C++ version 
6.0 automatic template instantiation facility and indi- 
cate areas to be explored for further improvement. 

C++ Template Facility 

The C++ language provides a template facility that 
allows the user to create a family of classes or functions 
that are parameterized by type.' ' For example, a user 
may provide a Stack template, which defines a stack 
class for its argument type. Consider the following 
template declaration: 

template <class T> class Stack { 

T * top_of_s tack; 
public : 

void push ( T arg ) ; 

void pop( Ti arg ) ; 

) ; 

The act of applying the arguments to the template 
is referred to as template instantiation. An instantia- 
tion of a template creates a new type or function that 
is defined for the specified types. Stacl«int> creates 
a class that provides a stack of the type int. 
Stack<user_class> creates a class that provides a stack 
of user_class. The types int and user_class are the argu- 
ments for the template Stack. 
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In general, n template needs to be instantiated when 
it is referenced. When a class template is instantiated, 
only those member functions and static data members 
that are referenced are also instantiated. In the Stack 
example, the member function Push of the class 
Stack<int> needs to be instantiated only if it is used. 
Template functions and static data members have 
global scope; therefore, only one instantiation of each 
should be in a user's application. Since source files are 
compiled separately and combined later at link time to 
produce an executable, the compiler alone is not able 
to ensure that one and only one instance of a specific 
template is efficiently generated for any given exe- 
cutable. That is, the compiler by itself is not able to 
know whether the function or variable definition for a 
specific template is satisfied by code generated in 
another object module. 

The C++ Standard provides facilities for the user to 
specify' where a template entity should be instantiated.' 
When the user explicitly specifies template instantia- 
tion, the user then becomes responsible for ensuring 
that there is only one instantiation of the template 
function or static data member per application. This 
responsibility can necessitate a considerable amount of 
work. However, the compiler and linker working 
together can provide effective template instantiation 
without specific user direction. 

In the following section, we present the various 
approaches that can be used for template instantiation. 

Template Instantiation Techniques 

Template instantiation techniques can be broadly cat- 
egorized as either manual or automatic. With manual 
instantiation, the compilation system responds to user 
directives to instantiate template entities. These direc- 
tives can be in the source program, or they may be 
command-line options. With automatic instantiation, 
the compilation system, including the linker, decides 
which instantiations are required and attempts to pro- 
vide them for the user's application. 

Manual Instantiation 

iManual template instantiation is the act of manually 
specifying that a template should be instantiated in the 
file that is being compiled. This instantiation is given 
global external linkage, so that references to the 
instantiation that are made in other tiles resolve to this 
template instantiation. Manual template instantiation 
includes explicit instantiation requests and pragmas as 
well as command-line options. 

Explicit Instantiation Requests and Pragmas The 

compilation system instantiates those template entities 
that the user specifies for instantiation. The specification 
can be made using the C++ explicit template instantia- 
tion syntax or may be made using implementation- 



defined directives or pragmas. Since instantiations are 
given global external linkage, the user must ensure 
that the specified template instantiations appear only 
once throughout all the modules that compose the 
program. When only this mode of instantiation is 
used, the user also must ensure that all required tem- 
plate instantiations are specified to avoid unresolved 
symbols at link time. 

Command-line Instantiation Command-line options 
can be used to specif)' template instantiation. They are 
similar in operation to the explicit instantiation requests, 
except they indicate groups of templates that should be 
instantiated, rather than naming specific templates to be 
instantiated. The command-line options include 

■ Instantiate All Templates. A command-line option 
can direct the compiler to instantiate all template 
entities whose definitions are known during compi- 
lation and whose argument types are specified. This 
has the advantage of specifying many template 
instantiations at once. The user must still ensure 
that no template instantiation happens more than 
once in the program and that all required instantia- 
tions are satisfied. Due to these requirements, the 
user cannot usually specify this option on more than 
one source-file compilation in the program. This 
option can also cause the instantiation of templates 
that are not used by the program. 

■ Instantiate Used Templates. A command-line option 
can be used to direct the compiler to instantiate 
only those template entities that are used by the 
source code and whose definitions are known at 
compilation. As in the previous technique, the user 
must ensure that no template instantiation happens 
more than once in the program and that all required 
instantiations are satisfied. Due to these require- 
ments, the user cannot usually specify this option 
on more than one source-file compilation in the 
program. 

■ Instantiate Used Templates Locally. This command- 
line option works like the instantiate used templates 
option, except that it defines each template instan- 
tiation locally in the current compilation . This option 
has the advantage of providing complete template 
instantiation coverage for the program, as long as 
the definitions of the used templates are available in 
each module. Since all template instantiations are 
given local scope, there is no potential problem 
with multiply defined instantiations when the 
program is linked. The major problem with this 
technique is that the user's application can be 
unnecessarily large, since the same template instan- 
tiations could appear within multiple object files 
used to link the application. This technique will fail 
if the instantiations must have global scope such as 
a class's static data members. 
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Figure 1 shows an example of a template function, 
template_func, that contains a locally defined static 
variable. As shown in the figure, the object files of both 
A and B contain local copies of template_func instanti- 
ated with int. Each instance of template_func<int> 
defines its own version of static variable x. In this case, 
directing the compiler to instantiate used templates 
locally yields a different result than instantiating all or 
used templates globally. 

If we give the static data members global scope and 
ensure that they are properly defined and initialized by 
executable code rather than by static initialization, we 
can solve the static data members problem. The appli- 
cation, however, remains unnecessarily large, because 
multiple copies of the instantiated templates can be 
present in the executable. 

Automatic Instantiation 

Automatic template instantiation relieves the user of 
the burden of determining which templates must be 
instantiated and where in the application those instanti- 
ations should take place. Automatic template instantia- 
tion can be divided into two categories: compile-time 
instantiation, whereby the decision about what should 
be instantiated is made at compile time, and link-time 
instantiation, whereby decisions about template instan- 
tiation are made when the user's application is linked. 
In both cases, specific link-time support is needed to 
select the required instantiations for the executable. 

Compile-time Instantiation Two major techniques 
can be used to perform automatic template instantia- 
tion at compile time. The choice between the two 
depends upon the facilities available in the linker. 
Microsoft Visual C++ instantiates templates at compile 
time using a strategy similar to the instantiate used 
templates command-line option described previously. 5 



Each instantiation is placed in the communal data sec- 
tion (COMDAT) of the current compilation's object 
file. Each object tile contains a copy of every template 
instantiation needed by that compilation unit. 
COMDATs are sections that have an attribute that tells 
the linker to accept, without issuing a warning, multi- 
ple definitions of a symbol defined in the section. 4 If 
more than one object file defines that symbol, only the 
section from one object tile is linked into the image 
and the rest are discarded, along with all symbols in 
the symbol table defined in the discarded section con- 
tribution. At link time, the linker resolves an instantia- 
tion reference by choosing one of the instantiations 
defined in an individual object file's COMDAT. The 
resulting user's application executable has a single 
copy of each requested instantiation. 

When such linker support is not available, another 
mechanism must be used to control compile-time 
instantiation. One such approach is to use a repository 
to contain the generated instantiations. The compiler 
creates the instantiations in the repository instead of 
the current compilation's object tile. At link time, the 
linker includes any requested instantiations from the 
repository. As a performance improvement, the com- 
piler can also decide whether an instantiation needs to 
be generated from the state of the repository. If the 
requested instantiation is in the repository and can be 
determined to be up to date, the compiler does not 
need to regenerate the instantiation. 

Link-time Instantiation The decision to instantiate can 
be left until link time. The linker can find die instantia- 
tions that are needed and direct the compiler to generate 
those instantiations. McCluskey describes one link-time 
instantiation scheme. 5 " The compiler logs every class, 
union, struct, or enum in a name-mapping file in a repos- 
itory. Every declared template is also logged in the name- 



// template . h 
^include <iostream.h> 

template siclass T? void Ceiriplate_f unc (T pi 
( 

static T x = 0; 
COUt « x ♦ p; 
X + + ; 

) 

//A.cxx //B.cxx 

* include "template, bxx" # include "template. r;xx" 

extern v«id b_func ( ) ; void k_func (void) 

int main [ ) { 

{ //. . . 

template_Eunc(l«) ; tempUte_£unc (20) ; 

b_f unc ( ) ; / / . . . 

return 0; } 

} 



Figure 1 

Template Function Containing a Locally Defined Static Variable 
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mapping file. At Link time, a prelinker determines which 
template instantiations are required. The prelinker builds 
temporary instantiation source files in the repository to 
satisfy the referenced instantiations, compiles them, and 
adds the resulting object files to the linJker input. 
Considerthe example in Figure 2. 

During the compilation of main.cxx, a name- 
mapping file is built in the repository and the location 
of the user-defined class C and the function template, 
perform_some_function, are recorded. From the infor- 
mation stored in the name- mapping file, an instan- 
tiation source file is then created in the repository. 
Figure 3 shows the contents of the instantiation source 
file created to satisfy perform_some_function<C>. 

The prelinker then compiles the instantiation source 
file by invoking the compiler in a special directed mode, 
which directs the compiler to generate code only for 
specific template instantiations that are listed on the 
command line. The compiler then generates the defin- 
ition of perform_some_function<C> in the resulting 
object file. The resulting object now satisfies the 
instantiation request and is included as part of the 
application's final link. To build the instantiation 
source files easily, the implementation of this scheme 
generally requires that template declarations, template 
definitions, and any argument types used to instantiate 
a class or function template must appear in separate, 
related header files. 

The Edison Design Group has developed another 
approach to link-time instantiation. 7 In this approach, 
the compiler records where template instantiations are 
used and where they can be instantiated. At link time, 
a prelinker assigns template instantiations by recording 
the assignments in a specially generated file that corre- 



/* perf orm_some_f unct ion (C&) */ 
itinclude " template . hxx" 
Rinclude "template.cxx" 
((include "C_class.h" 



Figure 3 

Example of an Instantiation Source File 



sponds to the particular source file that can success- 
fully instantiate the user's request. Compiling and pre- 
1 inking the program used in Figure 2 generates an 
instantiation assignment file for main.cxx. This file 
contains information concerning the command-line 
options specified, the user's current working directory, 
and a list of instantiations that should be instantiated. 
Main.cxx now owns the responsibility of instantiating 
perform_some_function<C>. The prelinker recompiles 
the source files, such as main.cxx, that have changes in 
their template instantiation assignments. The process 
is repeated until there are no changes made to the 
instantiation assignments. Then the final link can be 
completed. 

This approach has the advantage of requiring no 
special file structure to support automatic template 
instantiation. It is generally faster and simpler than 
McCluskey's approach, because fewer files are com- 
piled in the generation of the needed instantiations 
and the instantiations are generated in the context of 
the user's source code. In addition, the assignment of 
instantiations to source files can be preserved between 
recompilations of the source code, so that unless the 
structure of the application changes, the needed instanti- 
ations will be available without additional recompilation. 



//C_class . hxx 
class C ( 
publ ic : 
// . . . 

) ; 

//template. hxx 

template <class T> void perf orm_some_f unction (T kparam) ; 
//'template . cxx 

template <class T> void perform_some_f unction (T iparaml { ) 
//main.cxx 

ft include "C_c lass . hxx" 
# include "template. hxx" 

int mainO 
( 

C c; 

perform_some_£unction (c) ; 
return 0; 

> 



Figure 2 

Example of a Link-time Instantiation Scheme (McCluskey) 
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Comparison of Manual and Au tomatic Inst an tia tion 
Techniques 

The manual instantiation techniques require planning 
on the part of the user to ensure that needed instantia- 
tions are present, that no extraneous instantiations are 
generated, and that each needed instantiation appears 
exactly once within the application. With manual 
instantiation, the user has the advantage of gaining 
explicit control over all template instantiations. 
Although the strategy of instantiating used templates 
locally requires less planning, it does so at the cost of 
object file size and the restricted use of templates when 
static data members are present or when static data is 
defined locally within a function template instantiation. 

Automatic template instantiation provides template 
instantiation with no explicit action on the part of the 
user. Compile-time instantiation requires either spe- 
cific linker support to select a single template instanti- 
ation from potentially many candidates, or support by 
the compiler to generate instantiations in separate 
object files while compiling the user's source code. 
Relying on linker support allows the compiler to effi- 
ciently generate instantiations at the cost of larger 
object files; however, die user loses control over which 
instantiation is used in the executable file. Although 
the use of separate instantiation object files usually 
takes more time at compilation than the linker-support 
method, it results in more compact object files and can 
provide the user with more control over which instan- 
tiation is used in the executable file. 

Link-time instantiation provides template instan- 
tiation that is tailored to the needs of the executable 
file. The primary cost is link-time performance, since 
generation of instantiations occurs at link time. 
Another disadvantage of link-time instantiation can be 
observed when building object-code libraries. Either 
the library must contain all the instantiations that it 
requires, or the user who wants to link with the Library 
must have access to all the machinery to create instan- 
tiations. Creating a library's instantiations involves 
extra steps during library construction. All the object 
files to be included in the library must be prelinked, 
so that the needed instantiations are generated. If 
instantiations are included in the individual object 
files in the library, as in the Edison Design Group 
approach, unintended modules may be linked from 
the library to provide the needed instantiations. 
Consider the following scenario, in which object 
files A and B are included in the library. Both files 
require the instantiation of perform_some_function<int>. 
When these files are prelinked, the instantiation of 
perform_some_function<int> is assigned to one of 
the files, say A. If an application that is being linked 
against the library requires that the object file B be 
linked into the executable, then the object file A is also 
linked. Here the instantiation needed by B was instan- 



tiated in A even though the executable never refer- 
enced anything explicitly defined in fil e A. This can 
yield an unnecessarily large executable. 

In the next section, we review the template instan- 
tiation support in earlier versions of DIGITAL C++ 
and then discuss the rationale and design of the auto- 
matic template instantiation facility in version 6.0 of 
DIGITAL C++. 

DIGITAL C++ Template Instantiation Experience 

As the use of C++ templates has grown, DIGITAL 
C++ has been enhanced to support the need for 
improved instantiation techniques. The initial release 
of DIGITAL C++ occurred before the C++ standard- 
ization process had matured, so that the language sup- 
ported was based on Tloe Annotated C++ Reference 
Manual, referred to as the ARM. 8 The ARM defined 
template functionality, but it did not provide guidance 
for either manual or automatic template instantiation. 
Thus it was necessary to provide a DIGITAL C++- 
specific mechanism for template instantiation. 

DIGITAL C++ Manual Template Instantiation 

The #pragma define_template directive and the instan- 
tiate all command-line option, -define_templates, have 
been supported since the initial release of DIGITAL 
C++. 

In Figure 4, the define_template pragma directs the 
compiler to instantiate class template, C, with type int. 
When the compiler detects the use of the pragma, it 
creates an internal C<int> type node and traverses the 
list of static data members and member functions 
defined within the class. If the definitions of these 
members are present at the point the pragma is speci- 
fied, the compiler materializes each with type int. 

As the C++ language developed and template usage 
increased, users found manual template instantiation 
to be very labor intensive and requested an automated 
method. 

DIGITAL C++ Version S3 Automatic Template 
Instantiation 

Automatic template instantiation capability became a 
serious issue during the planning stages of DIGITAL 
C++ version 5.3. The use of templates was increasing 
rapidly, and many new third-party libraries, such as 
Rogue Wave Software's Tools. h++, contained a signif- 
icant use of templates. Due to this growing need, the 
requirements were straightforward. The support had 
to be easy to use, have a short design phase, be quickly 
implementable on both the DIGITAL UNIX and the 
OpenVMS platforms, and provide reasonable perfor- 
mance. Because McCluskey's approach had been used 
in several implementations, it presented itself as our 
best option. 
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template <class T> class C ( 
public : 

v»id mem_£up.cl (T p) ; 

void mem_f unc2 (T p) ; 



template <class T> void C<T> : : rnem_£uncl (T pt { //...} 
template <class T> void C<T> : : mem_f unc2 (T p) { //...) 

♦pragma def ine_template C<int> 



Figure 4 

The define_template Pragma 



DIGITAL made two major changes to McCluskey's 
approach to take advantage of the DIGITAL C++ 
compiler design. First, we allowed instantiation 
source files to be created at compile time instead of 
link time. This eliminated the need for McCluskey's 
name-mapping file and simplified the prelinking 
process considerably. Since the needed source files 
existed in the repository, there was no need to decon- 
struct the required template instantiations to deter- 
mine their arguments and types. 

The second change addressed the transitive closure 
problem. Figure 5 shows an example of the class tem- 
plate Buffer being instantiated with the user-defined type 
C. After compilation of app.cxx with the McCluskey 



approach, the name-mapping file contained definition 
locations of class B and class C. However, it did not con- 
tain any indication that class C had a data member that 
relied on the definition of class B. From the information 
in the name-mapping file, the prelinker then created an 
instantiation source file that included only C_class.hxx, 
Buffer.hxx, and Buffer.cxx When this instantiation 
source file was compiled, an error resulted complaining 
that B is an undefined type whose size is unknown. 

We solved this problem in DIGITAL C++ version 
5.3 by including all the top-level header files included 
by the current compilation unit in any instantiation 
source files created. Tliis ensured that B_class.hxx 
would be included in the generated instantiation file. 



//B_class . hxx 
class B { //...}; 



//C_class.hxx 
class C { 

B data_mem; 
public : 

// . . . 

) ; 



//Buffer.hxx //Buffer.cxx 
template <class T> class Buffer { template <class T> 

void Bu f f er<T> : : add_i tern ( T *p) { } 

T 'buffer; 
int num_of_i terns ; 
publ ic : 

void add_item(T *) ; 
// . . . 



/ /app . exx 

♦include "B_class . hxx" 
♦include "C_class . hxx" 
♦include "Buffer.hxx" 



void f(void) 
{ 

C c; 

Buffer<C> c_buffer; 
c_buf f er . add_i tern ( ic) ; 

} 



Figure 5 

Instantiation of the Class Template Buffer 
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Despite the fact that this type of automatic link- 
time instantiation scheme was being widely used 
in the industry, the results of using a modified 
McCluskey approach were mixed. Stroustrup has 
described the general problems with McCluskey 's 
approach.' We found that our implementation suf- 
fered particularly from poor link-time performance 
and so did not satisfy our users' needs. 

DIGITAL C++ Version 6.0 Automatic Template 
Instantiation 

DIGITAL C++ version 6.0 is a complete reimplemen- 
tation of DIGITAL C++, with emphasis on ANSI C++ 
conformance. It is implemented using a completely 
new code base, which includes the industry-standard 
C++ front end from the Edison Design Group and a 
standard class library from Rogue Wave. 

From our experience with template instantiation 
in DIGITAL C++ versions 5.3 through 5.6, we con- 
cluded that the most important issue that should 
be addressed in the design and implementation of 
the automatic template instantiation facility was the 
compile- and link-time performance. The primary 
goal was to have the performance of automatic tem- 
plate instantiation substantially exceed the perfor- 
mance of version 5.6. Another important goal was 
to remove the restriction of template declaration and 
definition placement in header files. In addition, the 
automatic template instantiation facility in version 6.0 
had to be culturally compatible with the previous 
implementation. The user had to be able to move 
sources and objects to different directories, easily 
build archived and shared libraries, share instantia- 
tions between various applications, and have error 
diagnostics reported at the earliest possible moment in 
the instantiation process. 

Design and Implementation We decided to use a 
compile-time instantiation model as the basis for our 
implementation. Since we were using the Edison 
Design Group's front end, we seriously considered 
using their link-time model. However, the compile- 
time model seemed advantageous for several reasons. 
First, there are significant complications (as described 
in the section Comparison of Manual and Automatic 
Instantiation Techniques) when trying to build 
libraries with a compiler that uses the Edison Design 
Group link-time model. In addition, the link-time 
model requires recompilations that limit performance 
in many typical cases of template use. We recognized 
that the link-time model could provide better perfor- 
mance in some cases, but these would be in the minor- 
ity. Finally, the implementation of the link-time model 
would require substantially more implementation 
effort on the OpenVMS platform. The version of the 
Edison Design Group front end being used to build 
DIGITAL C++ version 6.0 required tools to scan a 



user's object files for information concerning which 
modules could instantiate requested templates. Similar 
functionality would need to be implemented for the 
OpenVMS platform. 

We preserved the concept of the template reposi- 
tory as a directory that contains the individual tem- 
plate instantiation object files. The repository stores 
one object file for each template function, member 
function, static data member, and virtual table that is 
generated by automatic template instantiation. The 
file name of the instantiation object file is derived from 
the name of the instantiation's external name. At com- 
pile time, the front end generates intermediate code 
for all templates that are needed in the compilation 
unit and can be instantiated. A tree walk is performed 
over the intermediate code to find all entities that are 
needed by each generated template instantiation. The 
code generator is called to generate code for the user- 
specified object file and is then called repeatedly for 
each template instantiation to generate the instantia- 
tion object files in the repository. 

The compiler generally considers an instantiation to 
be needed when it is referenced from a context that is 
itself needed, such as in a function with global visibility or 
by the initialization of a variable that is needed. Virtual 
member functions are needed when a constructor for 
the class is needed. Thus, ail virtual function definitions 
should be visible in a compilation unit that requires a 
constructor for the class. Each instantiation that is gener- 
ated with automatic instantiation is marked as potentially 
being in its own object file in the repository. 

The intermediate representation of each generated 
instantiation is walked to determine what other entities 
it references. At this point, the instantiation is a candi- 
date to be generated in its own object file, but it can 
sometimes be generated as part of the user-specified 
object file. If the instantiation references an entity that 
is local to the compilation unit, such as a static func- 
tion, and that local entity is nonconstant and statically 
initialized, the instantiation is merged into the user- 
specified object file rather than generated in its own 
object file. As an alternative, we could have chosen to 
change the local entity into a global entity with a 
unique name and generate the instantiation in its own 
object file. We chose not to do this in order to make it 
easier to share a repository between applications. With 
this alternative, the instantiation in the repository 
requires the object file containing the local entity's def- 
inition, which may be in another application. Note that 
any application that contains more than one definition 
of the same instantiation that references a nonconstant 
local entity is a nonstandard -conforming application. 
This is a violation of die one definition rule. 10 Consider 
the following code fragment: 

static int j ; 

template <class T> int tunc (T arg) { return j ; ) 
int var = f unc ( 2.5 ) ; 
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The reference to the static variable j in the template 
function, func, prevents the template from being gen- 
erated into its own object file in the repositorv. 

When the individual instantiations are walked, we 
mark each global entity that is defined in the compila 
tion unit so that the definition is replaced by an exter- 
nal reference when the instantiation object file is 
generated. Consider the following code fragment: 

void print_counC (const char * s, int ivar) 
{ 

cout<< s <<":" « ivar; 

} 

template <class T> void func (T arg) 
< 

static int count = 0 ; 
print_count ( "count" , count**) ; 

> 

The function, print_count, is defined in the source 
file and generated as a defined function in the user- 
specified object file. The template function, func, refer 
ences the function, print_count. When the code for 
time is generated in its own object file, the reference to 
print_count must be changed from a reference to a 
defined function to a reference to an external function. 

By default, each needed instantiation is generated by 
every compilation that requires the instantiation. This 
is the safe default because it ensures that instantiations 
in the repository are up to date. However, there will 
probably be some compilation overhead from regener- 
ating instantiations that may already be up to date. We 
believed that the overhead of regenerating instantia 
tions would typically be relatively small. For applica 
tions with a high overhead of instantiation, such as a 
large number of source files using the same large num- 
ber of template instantiations, we provided a compila- 
tion option to control the generation of template 
instantiations to improve compile-time performance. 

The generation of instantiation object files only 
when they are actually required is a difficult problem. 
Fine-grain dependency information would have to be 
kept for each instantiation object file. Such depen- 
dency information would need to reflect those files that 
are required to successfully generate the instantiation 
and record which command-line options the user speci- 
fied to the compiler. We suspected that the overhead 
involved with gathering and checking the information 
might be an appreciable percentage of the time it would 
take to do the instantiation, and thus it would not give 
us the performance improvement that we wanted. 

Instead, we decided to provide an option that allows 
the user to decide when instantiations are generated. 
We refer to this as the template time-stamp option, 
-ttimestamp. When using the time-stamp option, the 
compiler looks in the repository for a file named 
TIM EST AM P. If the file is not found, it is created. The 
modification time of this file is referred to as the time 



stamp. When generating an instantiation, the compiler 
looks in the repository to see if die instantiation object 
file exists. If it does not exist, it is generated. If the file 
already exists, its modification time is compared to the 
time stamp. If the modification time is later than the 
time stamp, the instantiation is assumed to be up to 
date and is not regenerated. Otherwise, the instantia- 
tion is generated. The user can control the generation 
of instantiation object tiles by changing the modifica- 
tion time of the TIMESTAMP file. 

The time-stamp option would typically be used in 
a makefile or a shell script that compiles and builds 
an entire application. Before invoking make or the 
shell script, the user would make certain that no 
TIMESTAMP file resided in the repository. This 
would ensure that each needed instantiation would be 
generated exactly once during all the compilations 
done by the build procedure. 

Much of the C++ linker support in version 5.6 was 
reused with only minor modifications for version 
6.0. The compiler is presented with a single repository 
into which the instantiation object files are written. 
Multiple repositories can be specified at link time, and 
each can be searched for i nstantiations that are needed 
by the executable file. The linker is used in a trial link 
mode to generate a list of all the unresolved external 
references. This list is dnen used to search the reposito- 
ries to find the needed instantiation files, and the 
process is repeated until no more instantiations are 
needed or can be satisfied from the repository. The 
link then proceeds as any normal link, adding the list 
of instantiation object files to the list of object files 
and libraries as specified by the user. 

If a vendor is creating a library rather than an exe- 
cutable file, the instantiations needed by the modules 
in the library can be provided in either of two ways: (1) 
The library vendor can put the needed instantiations 
in the library by adding the files in the repository to 
the library file. (2) The library vendor can provide the 
repository with the library and require that library 
users link with the repository as well. Note that instan- 
tiations placed in the library are fixed when the library 
is created. Since the library is included in the trial link 
of an application, any instantiation in the library takes 
precedence over the same named instantiation in a 
repository. 

Results In a number of tests, DIGITAL C++ version 
6.0 showed improved performance over version 5.6. 
We tested a variety of user code samples that use tem- 
plates to varying degrees and found that build times for 
version 6.0 decreased substantially compared to the 
version 5.6 compiler. Examples of two typical C++ 
applications used in our tests are the publicly available 
EON ray-tracing benchmark and a subset of tests from 
our Standard Template Library (STL) test suite. For 
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the EON benchmark, the build time for version 6.0 was 
reduced to 28 percent of the build time for version 5.6. 
For the STL tests, the build time for version 6.0 was 
reduced to 19 percent of the build time for version 5.6. 
The number of files in the repository also decreased 
significantly because version 6.0 generates only instan- 
tiation object files instead of the instantiation source, 
command, dependency, and object files of version 5.6. 
For EON, the version 6.0 repository contained 88 files 
compared to 260 files in version 5.6. 

Using the time-stamp option, build time for the 
EON benchmark was reduced by only 5 percent com- 
pared to the default instantiation strategy. The real 
benefit of the time-stamp option comes with applica- 
tions thatuse the same template instantiations in many 
compilation units. For example, in one user's test case, 
build times dropped from roughly 18 hours with the 
default instantiation to 3 hours when using the time- 
stamp option. 

In the next section, we conclude our paper with a dis- 
cussion of further work that can improve the perfor- 
mance and usability of automatic template instantiation. 

Future Research 

We continue to investigate approaches and techniques 
to improve the usability and performance of the auto- 
matic template instantiation facility. Optimal usability 
and performance would seem to require a development 
environment completely integrated for C++. This envi- 
ronment would keep track of all entity definitions and 
usage and would be able to limit all instantiation gener- 
ation to the minimum needed. This approach would 
require a great deal of development work and might be 
difficult to integrate with existing customer develop- 
ment methodologies. Therefore, we focus on more 
modest techniques that approximate the optimal case. 

We are exploring ways to improve both performance 
and usability in the management of dependency infor- 
mation. We continue to look at approaches for using 
dependencies that can be reliable, automatic, and fast. 
We also continue to investigate ways to gather and check 
fine-grained dependency information for the instanti- 
ation object files, though performance is a concern. 
One approximation to the fine-grain dependency 
information that we are investigating is a larger grain 
dependency scheme. This technique creates a time 
stamp from the latest creation time of any source tile 
included during compilation of a given module. Any 
instantiation object file in the repository whose modi- 
fication time is later than this time stamp would not be 
regenerated. This approach is more automatic and can 
potentially yield better performance than our current 
time-stamp option, but it would not be sensitive to 
changes on the command line or changes to the struc- 



ture of the files used to generate the instantiation. For 
example, if the user specified an include directory 
of oldjnclude on the initial compilation and later 
specified an include directory of newjnclude, this 
approach would not recognize that different files were 
being included. 

Another approach to improving application build 
performance is to support a build facility that can 
make use of template information in determining 
dependency. Currently, each user-specified object file 
is dependent on all the included files necessary to 
create instantiation object files for template requests. 
When a change is made to a template definition, all the 
sources that reference the template need to be recom- 
piled. A build facility designed to be sensitive to tem- 
plate instantiation could detect that a change in the 
template definition was limited to the instantiation 
object file. It could then instruct the compiler to sup- 
press the regeneration of object files for source files 
that are only being recompiled due to the change in 
the template instantiation. Such a facility could also 
suppress the recompilation of any source file that 
would only reproduce the changes to instantiations 
that were already regenerated. 

Because we recognize that link-time instantiation 
can perform better in some cases than the compile-time 
approach, we are investigating the link-time instantia- 
tion model as a user option. 

Finally, we continue to look at ways to reduce the 
cost of generating each instantiation. For example, by 
default the compiler compresses the generated object 
files. Although most instantiation object files are small, 
many of them are potentially generated in a single com- 
pilation. As a result, the time to compress all the instan- 
tiation object files can be significant. Improvements 
such as not compressing small object files and/or 
improving the algorithm of the object file compression 
implementation itself could yield significant perfor- 
mance improvement. In addition to improvements 
that would reduce the ov erhead of generating instanti- 
ations, we are also researching ways to reduce die num- 
ber of instantiation object files. For example, we might 
combine all the virtual functions of a class into a single 
instantiation object file in the repository. 

Summary 

As with most engineering problems, no single approach 
to the automatic instantiation of templates is optimal for 
all potential uses of templates. Based on our experience 
with providing template support in DIGITAL C++, we 
chose to implement a compile-time automatic template 
instantiation scheme for version 6.0 that generates 
instantiation object files into a repository. This choice 
allows users to better control when template instantia- 
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tion occurs. In addition, it provides a substantial 
improvement in performance of template instantiation 
over version 5.6 and reduces the restrictions on the 
location of template declarations and definitions. We 
continue to investigate the template-instantiation imple- 
mentation to further improve compile- and link-time 
performance and ease of use. 
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Measurement and m**™** 
Analysis of C and C++ 
Performance 



As computer languages and architectures 
evolve, many more challenges are being pre- 
sented to compilers. Dealing with these issues 
in the context of the Alpha Architecture and the 
C and C++ languages has led Compaq's C and 
C++ compiler and engineering teams to develop 
a systematic approach to monitor and improve 
compiler performance at both run time and 
compile time. This approach takes into account 
five major aspects of product quality: function, 
reliability, performance, time to market, and 
cost. The measurement framework defines a 
controlled test environment, criteria for select- 
ing benchmarks, measurement frequency, and 
a method for discovering and prioritizing oppor- 
tunities for improvement. Three case studies 
demonstrate the methodology, the use of mea- 
surement and analysis tools, and the resulting 
performance improvements. 



Optimizing compilers are becoming ever more complex 
as languages, target architectures, and product features 
evolve. Languages contribute to compiler complexity 
with their increasing use of abstraction, modularity, 
delayed binding, polymorpliism, and source reuse, 
especially when these attributes are used in combina- 
tion. Modern processor architectures are evolving ever 
greater levels of internal parallelism in each successive 
generation of processor design. In addition, product 
feature demands such as support for fast threads and 
other forms of external parallelism, integration with 
smart debuggers, memory use analyzers, performance 
analyzers, smart editors, incremental builders, and feed- 
back systems continue to add complexity. At the same 
time, traditional compiler requirements such as stan- 
dards conformance, compatibility with previous ver- 
sions and competitors' products, good compile speed, 
and reliability have not diminished. 

All these issues arise in the engineering of Compaq's 
C and C++ compilers for the Alpha Architecture. 
Dealing with them requires a disciplined approach to 
performance measurement, analysis, and engineering of 
the compiler and libraries if consistent improvements in 
out-of-the-box and peak performance on Alpha proces- 
sors are to be achieved. In response, several engineering 
groups working on Alpha software have established 
procedures for feature support, performance measure- 
ment, analysis, and regression testing. 

The operating system groups measure and improve 
overall system performance by providing system-level 
tuning features and a variety of performance analysis 
tools. The Digital Products Division (DPD) Performance 
Analysis Group is responsible for providing official 
performance statistics for each new processor mea- 
sured against industry-standard benchmarks, such as 
SPECmarks published by the Standard Performance 
Evaluation Corporation and the TPC series of transac- 
tion processing benchmarks from the Transaction 
Processing Performance Council. The DPD Performance 
Analysis Group has established rigorous methods for 
analyzing these benchmarks and provides perfor- 
mance regression testing for new software versions. 
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Similarly, the Alpha compiler back-end development 
group (GEM) has established performance improve- 
ment and regression testing procedu res for SPEC marks; 
it also performs extensive run-time performance analy- 
sis of new processors, in conjunction with refining and 
developing new optimization techniques. Finally, con- 
sultants working with independent software vendors 
(ISVs) help the ISVs port and tune their applications 
to work well on Alpha systems. 

Although the effort from these groups does con- 
tribute to competitive performance, especially on 
industry-standard benchmarks, the DEC C and C++ 
compiler engineering teams have found it necessary to 
independently monitor and improve both run-time 
and compile-time performance. In many cases, ISV 
support consultants have discovered that their applica- 
tions do not achieve the performance levels expected 
based on industry-standard benchmarks. We have seen 
a variety of causes: New language constructs and prod- 
uct features are slow to appear in industry bench- 
marks, thus these optimizations have not received 
sufficient attention. Obsolete or obsolescent source 
code remaining in the bulk of existing applications 
causes default options/switches to be selected that 
inhibit optimizations. Many of the most important 
optimizations used for exploiting internal parallelism 
make assumptions about code behavior that prove to 
be wrong. Bad experiences with compiler bugs induce 
users to avoid optimizations entirely. Configuration 
and source-code changes made just before a product is 
released can interfere with important optimizations. 

For all these reasons, we have used a systematic 
approach to monitor, improve, and trade off five 
major aspects of product quality in the DEC C and 
DIGITAL C++ compilers. These aspects are function, 
reliability, performance, time to market, and cost. 
Each aspect is chosen because it is important in isola- 
tion and because it trades off against each of the other 
aspects. The objective of this paper is to show how the 
one characteristic of performance can be improved 
while minimizing the impact on the other four aspects 
of product quality. 

In this paper, we do not discuss any individual opti- 
mization methods in detail; there is a plethora of liter- 
ature devoted to these topics, including a paper 
published in this Journal! Nor do we discuss specific 
compiler product features needed for competitive sup- 
port on individual platforms. Instead, we show how 
the efforts to measure, monitor, and improve perfor- 
mance are organized to minimize cost and time to 
market while maximizing function and reliability. 
Since all these product aspects are managed in the con- 
text of a series of product releases rather than a single 
release, our goals are frequently expressed in terms of 
relationships between old and new product versions. 



For example, for the performance aspects, goals along 
the following lines are common: 

■ Optimizations should not impose a compile-speed 
penalty on programs for which they do not apply. 

■ The use of unrelated compiler features should not 
degrade optimizations. 

■ New optimizations should not degrade reliability. 

■ New optimizations should not degrade perfor- 
mance in any applications. 

■ Optimizations should not impose any nonlinear 
compile-speed penalty. 

■ No application should experience run-time speed 
regressions. 

■ Specific benchmarks or applications should achieve 
specific run-time speed improvements. 

■ The use of specific new language features should not 
introduce compile-speed or run-time regressions. 

In the context of performance, the term measure- 
ment usually refers to crude metrics collected during 
an automated script, such as compile time, run time, 
or memory usage. The term analysis, in contrast, 
refers to the process of breaking down the crude mea- 
surement into components and discovering how the 
measurement responds to changing conditions. For 
example, we analyze how compile speed responds to 
an increase in available physical memory. Often, a 
comprehensive analysis of a particular issue may 
require a large number of crude measurements. The 
goal is usually to identify a particular product feature 
or optimization algorithm that is failing to obey one of 
the product goals, such as those listed above, and 
repair it, replace it, or amend the goal as appropriate. 
As always, individual instances of this approach are 
interesting in themselves, but the goal is to maximize 
the overall performance while minimizing the devel- 
opment cost, new feature availability, reliability, and 
time to market for the new version. 

Although some literature 3 J discusses specific aspects 
of analyzing and improving performance of C and C++ 
compilers, a comprehensive discussion of the practical 
issues involved in the measurement and analysis of 
compiler performance has not been presented in the 
literature to our knowledge. In this paper, we provide a 
concrete background for a practitioner in the field of 
compilation-related performance analysis. 

In the next section, we describe the metrics associ- 
ated with the compiler's performance. Following that, 
we discuss an environment for obtaining stable perfor- 
mance results, including appropriate benchmarks, 
measurement frequency, and management of the results. 
Finally, we discuss the tools used for performance mea- 
surement and analysis and give examples of the use of 
those tools to solve real problems. 
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Performance Metrics 

In our experience, ISVs and end users are most inter- 
ested in the following performance metrics: 

■ Function. Although function is not usually consid- 
ered an aspect of performance, new language and 
product features are entirely appropriate to consider 
among potential performance improvements when 
trading off development resources. From the point 
of view of a user who needs a particular feature, the 
absence of that feature is indistinguishable from an 
unacceptably slow implementation of that feature. 

■ Reliability. Academic papers on performance sel- 
dom discuss reliability, but it is crucial. Not only is 
an unreliable optimization useless, often it preju- 
dices programmers against using any optimiza- 
tions, thus degrading rather than enhancing overall 
performance. 

■ Application absolute run time. Typically, the absolute 
run time of an application is measured for a bench- 
mark with specific input data. It is important to real- 
ize, however, that a user-supplied benchmark is often 
only a surrogate for the maximum application size. 

■ Maximum application size. Often, the end user is 
not trying to solve a specific input setin the shortest 
time, instead, the user is trying to solve the largest 
possible real-world problem within a specific time. 
Thus, trends (e.g., memory bandwidth) are often 
more important than absolute timings. This also 
implies that specific benchmarks must be retired or 
upgraded when processor improvements moot their 
original rationale. 

■ Price/Performance ratio. Often, the most effective 
competitor is not the one who can match our 
product's performance, but the one who can give 
acceptable performance (see above) with the cheapest 
solution. Since compiler developers do not contribute 
directly to server or workstation pricing decisions, 
they must use die previous metrics as surrogates. 

Compile speed. This aspect is primarily of interest to 
application developers rather than end users. 
Compile speed is often given secondary considera- 
tion in academic papers on optimization; however, it 
can make or break the decision of an ISV consider- 
ing a platform or a development environment. Also, 
for C++, there is an important distinction between 
ab initio build speed and incremental build speed, 
due to the need for template instantiation. 

Result file size. Both the object file and executable 
file sizes are important. This aspect was not a partic- 
ular problem with C, but several language features 
of C++ and its optimizations can lead to explosive 
growth in result file size. The most obvious prob- 
lems are the need for extensive function inlining 



and for instantiation of templates. In addition, for 
debug versions of the result files, it is essential to 
find a way to suppress repeated descriptions of the 
type information for variables in multiple modules. 

■ Compiler dynamic memory use. Peak usage, aver- 
age usage, and pattern of usage must be regulated 
to keep the cost of a minimum development con- 
figuration low. In addition, it is important to ensure 
that specific compiler algorithms or combinations 
of them do not violate the usage assumptions built 
into the paging system, which can make the system 
unusable during large compilations. 

Crude measurements can be made for all or most of 
these metrics in a single script. When attempting to 
make a significant improvement in one or more met- 
rics, however, the change often necessarily degrades 
others. This is acceptable, as long as the only cases that 
pay a penalty (e.g., in larger dynamic memory use) arc 
the compilations that benefit from the improved run- 
time performance. 

As the list of performance metrics indicates, the most 
important distinction is made between compile-time 
and run-time metrics. In practice, we use automated 
scripts to measure compile-time and run-time perfor- 
mance on a fairly frequent (daily or weekly during 
development) basis. 

Compile-Time Performance Metrics 

To measure compile-time performance, we use four 
metrics: compilation time, size of die generated objects, 
dynamic memory usage during compilation, and tem- 
plate instantiation time for C++. 

Compilation Time The compilation time is measured 
as die time it takes to compile a given set of sources, 
typically excluding the link time. The link time is 
excluded so that only compiler performance is mea- 
sured. This metric is important because it directly 
affects die productivity of a developer. In the C++ case, 
performance is measured ab initio, because our prod- 
uct set does not support incremental compilation 
below the granularity of a whole module. When opti- 
mization of the entire program is attempted, this may 
become a more interesting issue. The UNIX shell tim- 
ing tools make a distinction between user and system 
time, but this is not a meaningful distinction for a com- 
piler user. Since compilation is typically CFU intensive 
and system time is usually modest, tracking the sum of 
bodi the user and the system time gives the most realis- 
tic result. Slow compilation rimes can be caused by the 
use of O (n 1 ) algorithms in the optimization phases, 
but they can also be frequently caused by excessive 
layering or modularity due to code reuse or excessive 
growth of the in-memory representation of the pro- 
gram during compilation (e.g., due to inlining). 
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Size of Generated Objects Excessive size of generated 
objects is a direct contributor to slow compile and 
link times. In addition to the obvious issues of inlin- 
ing and template instantiation, duplication of the type 
and naming information in the symbolic debugging 
support has been a particular problem with C++. 
Compression is possible and helps with disk space, but 
this increases link time and memory use even more. 
The current solution is to eliminate duplicate informa- 
tion present in multiple modules of an application. 
This work requires significant support in both the 
linker and the debugger. As a result, the implementa- 
tion has been difficult. 

Dynamic Memory Usage during Compilation Usually 
modern compilers have a multiphase design whereby 
the program is represented in several different forms in 
dynamic memory during the compilation process. For 
C and C++ optimized compilations, this involves at 
least the following processes: 

■ Retrieving the entire source code for a module 
from its various headers 

■ Preprocessing the source according to the C/C++ 
rules 

■ Parsing the source code and representing it in an 
abstract form with semantic information embedded 

■ For C++, expanding template classes and functions 
into their individuaJ instances 

■ Simplifying high-level language constructs into a 
form acceptable to the optimization phases 

■ Converting the abstract representation to a differ- 
ent abstract form acceptable to an optimizer, usu- 
ally called an intermediate language (IL) 

■ Expanding some low-level functions inline into the 
context of their callers 

■ Performing multiple optimization passes involving 
annotation and transformation of the IL 

■ Converting the ILtoa form symbolically represent- 
ing the target machine language, usually called code 
generation 

■ Performing scheduling and other optimizations on 
the symbolic machine language 

■ Converting the symbolic machine language to actual 
object code and writing it onto disk 

In modern C and C++ compilers, these various inter- 
mediate forms arc kept entirely in dynamic memory. 
Although some of these operations can be performed 
on a function-by-function basis within a module, it is 
sometimes necessary for at least one intermediate form 
of the module to reside in dynamic memory in its 
entirety. In some instances, it is necessary to keep mul- 
tiple forms of the whole module simultaneously. 



This presents a difficult design challenge: how do we 
compile large programs using an acceptable amount of 
virtual and physical memory? Trade-offs change con- 
stantly as memory prices decline and paging algorithms 
of operating systems change. Some optimizations even 
have the potential to expand one of the intermediate 
representations into a form that grows faster than the 
sizeofthe program (0(n x log(n)), or even 0(n 3 )). In 
these cases, optimization designers often limit the 
scope of the transformation to a subset of an individual 
function (e.g., a loop nest) or use some other means to 
artificially limit the dynamic memory and computation 
requirements. To allow additional headroom, upstream 
compiler phases are designed to eliminate unnecessary 
portions of the module as early as possible. 

In addition, the memory management systems are 
designed to allow internal memory reuse as effi- 
ciently as possible. For this reason, compiler design- 
ers at Compaq have generally preferred a zone-based 
memory management approach rather than either a 
malloc-based or a garbage-collection approach. A 
zoned memory approach typically allows allocation 
of varying amounts of memory into one of a set of 
identified zones, followed by deallocation of the 
entire zone when all the individual allocations are no 
longer needed. Since the source program is repre- 
sented by a succession of internal representations 
in an optimizing compiler, a zoned-based memory 
management system is very appropriate. 

The main goals of the design are to keep the peak 
memory use below any artificial limits on the virtual 
memory available for all the actual source modules 
that users care about, and to avoid algorithms that 
access memory in a way that causes excessive cache 
misses or page faults. 

Template Instantiation Time for C++ Templates are a 
major new feature of the C++ language and are heavily 
used in the new Standard Library. Instantiation of 
templates can dominate the compile time of the mod- 
ules that use them. For this reason, template instantia- 
tion is undergoing active studv and improvement, 
both when compiling a module for the first time and 
when recompiling in response to a source change. An 
improved technique, now widely adopted, retains pre- 
compiled instantiations in a library to be used across 
compilations of multiple modules. 

Template instantiation may be done at either com- 
pile time or during link rime, or some combination. 5 
DIGITAL C++ has recently changed from a link- time 
to a compile-time model for improved instantiation 
performance. The instantiation time is generally pro- 
portional to the number of templates instantiated, 
which is based on a command-line switch specification 
and the time required to instantiate a typical template. 
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Run Time Performance Metrics 

We use automated scripts to measure run-time perfor- 
mance for generated code, the debug image size, die pro- 
duction image size, and specific optimizations triggered. 

Run Time for Generated Code The run time for gen- 
erated code is measured as the sum of user and system 
time on UNIX required to run an executable image. 
This is the primary metric for the quality of generated 
code. Code correctness is also validated. Comparing 
run times for slightly differing versions of synthetic 
benchmarks allows us to test support for specific opti- 
mizations. Performance regression testing on both 
synthetic benchmarks and user applications, however, 
is the most cost-effective method of preventing per- 
formance degradations. Tracing a performance regres- 
sion to a specific compiler change is often difficult, but 
the earlier a regression is detected, the easier and 
cheaper it is to correct. 

Debug Image Size The size of an image compiled 
with the debug option selected during compilation is 
measured in bytes. It is a constant struggle to avoid 
bloat caused by unnecessary or redundant information 
required for symbolic debugging support. 

Production Image Size The size of a production 
(optimized, with no debug information) application 
image is measured in bytes. The use of optimization 
techniques has historically made this size smaller, but 
modern RISC processors such as the Alpha micro- 
processor require optimizations that can increase code 
size substantially and can lead to excessive image sizes 
if the techniques are used indiscriminately. Heuristics 
used in the optimization algorithms limit this size 
impact, however, subtle changes in one part of the 
optimizer can trigger unexpected size increases that 
affect I-cachc performance. 

Specific Optimizations Triggered In a multiphase 
optimizing compiler, a specific optimization usually 
requires preparatory contributions from several 
upstream phases and cleanup from several down- 
stream phases, in addition to the actual transforma- 
tion. In this environment, an unrelated change in one 
of the upstream or downstream phases may interfere 
with a data structure or violate an assumption 
exploited by a downstream phase and thus generate 
bad code or suppress the optimizations. The genera- 
tion of bad code can be detected quickly with auto- 
mated testing, but optimization regressions are much 
harder to find. 

For some optimizations, however, it is possible to 
write test programs that are clearly representative 
and can show, either by some kind of dumping or 
by comparative performance tests, when an imple- 
mented optimization fails to work as expected. One 



commercially available test suite is called NULLSTONE," 
and custom-written tests are used as well. 

In a collection of such tests, the total number of opti- 
mizations implemented as a percentage of the total 
tests can provide a useful metric. This metric can indi- 
cate if successive compiler versions have improved and 
can help in comparing optimizations implemented in 
compilers from different vendors. The optimizations 
that are indicated as not implemented provide useful 
data for guiding future development effort. 

The application developer must always consider the 
compile-time versus run-time trade-off. In a well- 
designed optimizing compiler, longer compile times 
arc exchanged for shorter run times. This relationship, 
however, is far from linear and depends on the impor- 
tance of performance to the application and the phase 
of development. 

During the initial code-development stage, a shorter 
compile time is useful because the code is compiled 
often. During the production stage, a shorter run time 
is more important because the code is run often. 
Although most of the above metrics can be directly 
measured, dynamic memory use can only be indirectly 
observed, for example, from the peak stack use and the 
peak heap use. As a result, our tests include bench- 
marks that potentially make heavy use of dynamic 
memory. Any degradation in a newer compiler version 
can be deduced from observing the compilation of 
such test cases. 

Environment for Performance Measurement 

In this section, we describe our testing environment, 
including hardware and software requirements, crite- 
ria for selecting benchmarks, frequency of perfor- 
mance measurement, and tracking the results of our 
performance measurements. 

Compiler performance analysis and measurement 
give the most reliable and consistent results in a 
controlled environment. A number of factors other 
than the compiler performance have the potential of 
affecting the observed results, and the effect of such 
perturbations must be minimized. The hardware and 
software components of the test environment used arc 
discussed below. 

Experience has shown that it helps to have a dedi- 
cated machine for performance analysis and measure- 
ment, because the results obtained on the same 
machine tend to be consistent and can be meaning- 
fully compared with successive runs. In addition, the 
external influences can be closely controlled, and ver- 
sions of system software, compilers, and benchmarks 
can be controlled without impacting other users. 

Several aspects of the hardware configuration on the 
test machine can affect the resulting measurements. 
Even within a single family of CPU architectures at 
comparable clock speeds, differences in specific imple- 
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mentations can cause significant performance changes. 
The number of levels and the sizes of the on-chip and 
board- level caches can have a strong effect on perfor- 
mance in a way that depends on algorithms of the 
application and the size of the input data set. The size 
and the access speed of the main memory strongly 
affect performance, especially when the application 
code or data does not fit into the cache. The activity on 
a network connected to the test system can have an 
effect on performance; for example, if the test sources 
and the executable image are located on a remote disk 
and are fetched over a network. Variations in the 
observed performance may be divided into two parts: 
( 1 ) system-to-system variations in measurement when 
running the same benchmark and (2) run-to-run varia- 
tion on the same system running the same benchmark. 

Variation due to hardware resource differences 
between systems is addressed by using a dedicated 
machine for performance measurement as indicated 
above. Variation due to network activity can be mini- 
mized by closing all the applications that make use of 
the network before the performance tests are started 
and by using a disk system local to the machine under 
test. The variations due to cache and main memory 
system effects can be kept consistent between runs by 
using similar setups for successive runs of performance 
measurement. 

In addition to the hardware components of the 
setup described above, several aspects of the software 
environment can affect performance. The operating 
system version used on the test machine should corre- 
spond to the version that the users are likely to use on 
their machines, so that the users see comparable per- 
formance. The libraries used with the compiler are 
usually shipped with the operating system. Using dif- 
ferent libraries can affect performance because newer 
libraries may have better optimizations or new fea- 
tures. The compiler switches used while compiling test 
sources can result in different optimization trade-offs. 
Due to the large number of compiler options sup- 
ported on a modern compiler, it is impractical to test 
performance with all possible combinations. 

To meet our requirements, we used the following 
small set of switch combinations: 

1. Default Mode. The default mode represents die 
default combination of switches selected for the com 
piler when no user-selectable options are specified. 
The compiler designer chooses the default combina- 
tion to provide a reasonable trade-off between com- 
pile speed and run speed. The use of this mode is very 
common, especially by novices, and thus is important 
to measure. 

2 . Debug Mode. In the debug mode, we test the option 
combination that the programmer would select when 
debugging. Optimizations are typically turned off, 
and full symbolic information is generated about the 



types and addresses of program variables. This mode 
is commonly specified during code development. 

3. Optimize/Production Mode. In the optimize/ 
production mode, we select the option combina- 
tion for generating optimized code (-0 compiler 
option) for a production image. This mode is most 
likely to be used in compiling applications before 
shipping to customers. 

We prefer to measure compile speed for debug mode, 
run speed for production mode, and both speeds for 
the default mode. The default mode is expected to lose 
only modest run speed over optimize mode, have good 
compile speed, and provide usable debug information. 

Criteria for Selecting Benchmarks 

Specific benchmarks are selected for measuring perfor- 
mance based on the ease of measuring interesting 
properties and the relevance to the user community. 
The desirable characteristics of useful benchmarks are 

■ It should be possible to measure individual opti- 
mizations implemented in the compiler. 

■ It should be possible to test performance for com- 
monly used language features. 

■ At least some of the benchmarks should be repre- 
sentative of widely used applications. 

■ The benchmarks should provide consistent results, 
and the correctness of a run should be verifiable. 

■ The benchmarks should be scalable to newer 
machines. As newer and faster machines are devel- 
oped, the benchmark execution times diminish. It 
should be possible to scale the benchmarks on the 
machines, so that useful results can still be obtained 
without significant error in measurement. 

To meet these diverse requirements, we selected a set 
of benchmarks, each of which meets some of the 
requirements. We grouped our benchmarks in accor- 
dance with the performance metrics, that is, as compile- 
time and run-time benchmarks. This distinction is 
necessary because it allows us to fine -time the contents 
of the benchmarks under each category. The compile- 
time and run-time benchmarks may be further classified 
as (1 ) synthetic benchmarks for testing the performance 
of specific features or (2) real applications that indicate 
typical performance and combine the specific features. 

Compile-Time Benchmarks Examples of synthetic 
compile-time benchmarks include the #define inten- 
sive preprocessing test, the array intensive test, the 
comment intensive test, the declaration processing 
intensive test, the hierarchical #include intensive test, 
the printf intensive test, the empty #include intensive 
test, the arithmetic intensive test, the function defini- 
tion intensive test (needs a large memory), and the 
instantiation intensive test. 
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Real applications used as compile-timc bench 
marks include selected sources from the C compiler, 
the DIGITAL UNIX operating system, UNIX utilities 
such as awk, the X window interface, and C++ class 
inheritance. 

Run-Time Benchmarks Synthetic run-time bench- 
marks contain tests for individual optimizations for 
different data type, storage types, and operators. One 
run-time suite called NULLSTONE 0 contains tests for 
C and C++ compiler optimizations; another test suite 
called Bench++ 7 has tests for C++ features such as vir- 
tual function calls, exception handling, and abstraction 
penalty (the Haney kernels test, the Stepanov bench- 
mark, and the OOPACK benchmark"). 

Run-time benchmarks of real applications for the C 
language include some of the SPEC tests that are closely 
tracked by the DPD Performance Group. For C++, the 
tests consist of the groff word processor processing a set 
of documents, the EON ray tracing benchmark, the 
Odbsim-a database simulator from the University of 
Colorado, and tests that call functions from a search 
class library. 

Acquiring and Maintaining Benchmarks 

We have established methods of acquiring, maintain- 
ing, and updating benchmarks. Once the desirable 
characteristics of the benchmarks have been identified, 
useful benchmarks may be obtained from several 
sources, notably a standards organization such as 
SPEC or a vendor such as Nullstone Corporation . The 
public domain can provide benchmarks such as EON, 
groff, and Bench++. The use of a public-domain 
benchmark may require some level of porting to make 
the benchmark usable on the test platform if the origi- 
nal application was developed for use with a different 
language dialect, e. g., GNU's gcc. 

Sometimes, customers encounter performance prob- 
lems with a specific feature usage pattern not anticipated 
by the compiler developers. Customers can provide 
extracts of code that a vendor can use to reproduce 
these performance problems. These code extracts can 
form good benchmarks for use in future testing to avoid 
reoccurrence of the problem. 

Application code such as extracts from the compiler 
sources can be acquired from within the organization. 
Code may also be obtained from other software devel- 
opment groups, e. g., the class library group, the 
debugger group, and the operating system group. 

If none of these sources can yield a benchmark with 
a desirable characteristic, then one may be written 
solely to test the specific feature or combination. 

In our tests of the DIGITAL C++ compiler, we 
needed to use all the sources discussed above to obtain 
C++ benchmarks that test the major features of the 
language. The public-domain benchmarks sometimes 
required a significant porting effort because of com- 



patibility issues between different C++ dialects. We 
also reviewed the results published by other C++ com- 
piler vendors. 

Maintaining a good set of performance measurement 
benchmarks is necessary for evolving languages such as 
C and C++. New standards are being developed for 
these languages, and standards compatibility may make 
some of a benchmark's features obsolete. Updating the 
database of benchmarks used in testing involves 

■ Changing the source of existing benchmarks to 
accommodate system header and default behavior 
changes 

■ Adding new benchmarks to the set when new com- 
piler features and optimizations are implemented 

■ Deleting outdated benchmarks that do not scale 
well to newer machines 

In the following subsection, we discuss the fre- 
quency of our performance measurement. 

Measurement Frequency 

When deciding how often to measure compiler per- 
formance, we consider two major factors: 

■ It is costly to track down a specific performance 
regression amid a large number of changes. In fact, 
it sometimes becomes more economical to address 
a new opportunity instead. 

■ In spite of automation, it is still costly to run a suite 
of performance tests. In addition to the actual run 
time and the evaluation time, and even with signifi- 
cant efforts to filter out noise, the normal run-to- 
run variability can show phantom regressions or 
improvements. 

These considerations naturally lead to two obvious 
approaches to test frequency: 

■ Measuring at regular intervals. During active devel- 
opment, measuring at regular intervals is the most 
appropriate policy. It allows pinpointing specific 
performance regressions most cheaply and permits 
easy scheduling and cost management. The interval 
selected depends on the amount of development 
(number of developers and frequency of new code 
check-ins) and the cost of the testing. In our tests, 
the intervals have been as frequent as three days and 
as infrequent as 30 days. 

■ Measuiing on demand. Measurement is performed 
on demand when significant changes occur, for 
example, the delivery of a major new version of a 
component or a new version of the operating system. 
A full performance test is warranted to establish a 
new baseline when a competitor's product is released 
or to ensure that a problem has been corrected . 

Both strategies, if implemented purely, have problems. 
Frequent measurement can catch problems early but is 
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resource intensive, whereas an on-demand strategy 
may not catch problems early enough and may not 
allow sufficient time to address discovered problems. 
In retrospect, we discovered that the time devoted to 
more frequent runs of existing tests could be better 
used to develop new tests or analyze known results 
more fully. 

We concluded that a combination strategy is the best 
approach. In our case all the performance tests are run 
prior to product releases and after major component 
deliveries. Periodic testing is done during active devel- 
opment periods. The measurements can be used for 
analyzing existing problems, analyzing and comparing 
performance with a competing product, and finding 
new opportunities for performance improvement. 

Managing Performance Measurement Results 

Typically, the first time a new test or analysis method is 
used, a few obvious improvement opportunities are 
revealed that can be cheaply addressed. Long-term 
improvement, however, can only be achieved by going 
beyond this initial success and addressing the remain- 
ing issues, which are either costly to implement or 
which occur infrequently enough to make the effort 
seem unworthy. This effort involves systematically 
tracking the performance issues uncovered by the 
analysis and judging the trends to decide which 
improvement efforts are most worthwhile. 

Our experience shows that rigorously tracking all 
the performance issues resulting from the analyses 
provides a long list of opportunities for improvement, 
far more than can be addressed during the develop- 
ment of a single release. It thus became obvious that, 
to deploy our development resources most effectively, 
we needed to devise a good prioritization scheme. 

For each performance opportunity on our list, we 
keep crude estimates of three criteria: usage frequency, 
payoff from implementation, and difficulty of imple- 
mentation. We then use the three criteria to divide the 
space of performance issues into equivalence classes. 
We define our criteria and estimates as follows: 

■ Usage frequency. The usage frequency is said to be 
common if the language feature or code pattern 
appears in a large fraction of source modules or 
uncommon if it appears in only a few modules. 
When the language feature or code pattern appears 
in most modules for a particular application domain 
predominantly, the usage frequency is said to be 
skewed. The classic example of skewed usage is the 
complex data type. 

■ Payoff from implementation. Improvement in an 
implementation is estimated as high, moderate, or 
small. A high improvement would be the elimina- 
tion of the language construct (e.g., removal of 
unnecessary constructors in C++) or a significant 
fraction of their overhead (e.g., inlining small func- 



tions). A moderate improvement would be a 10 to 
50 percent increase in the speed of a language fea- 
ture. A small improvement such as loop unrolling 
is worthwhile because it is common. 

■ Difficulty of implementation. We estimate the 
resource cost for implementing the suggested 
optimization as difficult, straightforward, or easy. 
Items are classified based on the complexity of 
design issues, total code required, level of risk, or 
number and size of testing requirements. An easy 
improvement requires little up-front design and 
no new programmer or user interfaces, introduces 
little breakage risk for existing code, and is typically 
limited to a single compilerphase, even if it involves 
a substantial amount of new code. A straightfor- 
ward improvement would typically require a sub- 
stantial design component with multiple options 
and a substantial amount of new coding and testing 
but would introduce little risk. A difficult improve- 
ment would be one that introduces substantial risk 
regardless of the design chosen, involves a new user 
interface, or requires substantial new coordination 
between components provided by different groups. 

For each candidate improvement on our list, we 
assign a triple representing its priority, which is a 
Cartesian product of the three components above: 

Priority = (frequency) x (payoff) x (difficulty) 

This classification scheme, though crude and subjec- 
tive, provides a useful base for resource allocation. 
Opportunities classified as common, high, and easy are 
likely to provide the best resource use, whereas those 
issues classified as uncommon, small, and difficult are 
the least attractive. This scheme also allows manage- 
ment to prioritize performance opportunities against 
functional improvements when allocating resources 
and schedule for a product release. 

Further classification requires more judgment and 
consideration of external forces such as usage trends, 
hardware design trends, resource availability, and 
expertise in a given code base. Issues classified as com- 
mon and high but difficult are appropriate for a major 
achievement of a given release, whereas an opportu- 
nity that is uncommon and moderate but easy might 
bean appropriate task for a novice compiler developer. 

So-called "nonsense optimizations" are often con- 
troversial. These are opportunities that are almost 
nonexistent in human-written source code, for exam- 
ple, extensive operations on constants. Ordinarily they 
would be considered unattractive candidates; how- 
ever, they can appear in hidden forms such as the result 
of macro expansion or as the result of optimizations 
performed by earlier phases. In addition, they often 
have high per-use payoff and are easy to implement, so 
it is usually worthwhile to implement new nonsense 
optimizations when they are discovered. 
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Management eontrol and resource allocation issues 
can arise when common, high, or easy opportunities 
involve software owned by groups not under the 
direct control of the compiler developers, such as 
headers or libraries. 

Tools and Methodology 

We begin this section with a discussion of performance 
evaluation tools and their application to problems. We 
then briefly present the results of three case studies. 

Tools and Their Application to Problems 

Tools for performance evaluation are used for either 
measurement or analysis. Tools for measurement are 
designed mainly for accurate, absolute timing. Low 
overhead, reproducibility, and stability are more 
important than high resolution. Measurement tools 
are primarily used in regression testing to identify the 
existence of new performance problems. Tools for 
analysis, on the other hand, are used to isolate the 
source code responsible for the problem. High, rela- 
tive accuracy is more important than low overhead or 
stability here. Analysis tools tend to be intrusive: they 
add instrumentation to either the sources or the exe- 
cutable image in some manner, so that enough infor- 
mation about the execution can be captured to 
provide a detailed profile. 

We have constructed adequate automated measure- 
ment tools using scripts layered over standard operating 
system timing packages. For compile-time measure- 
ment, a driver reads the compile commands from a file 
and, after compiling the source the specified number 
of times, writes the resulting timings to a file. Post- 
processing scripts evaluate the usability of the results 
(average times, deviations, and file sizes) and compare 
the new results against a set of reference results. For 
compile-time measurement, the default, debug, and 
optimize compilation modes are all tested, as previ- 
ously discussed. 

These summarized results indicate if the test version 
has suffered performance regressions, the magnitude 
of these regressions, and which benchmark source is 
exhibiting a regression. Analysis of the problem can 
then begin. 

The tools we use for compile-speed and run-time 
analysis are considerably more sophisticated than the 
measurement tools. They are generally provided by 
the CPU design or operating system tools develop- 
ment groups and are widely used for application tun- 
ing as well as compiler improvements. We have used 
the following compile-speed analysis tools: 

■ The compiler's internal -show statistics feature 
gives a crude measure of the time required for each 
compiler phase. 



■ The gprof and hiprof tools are supplied in the 
development suites for DIGITAL UNIX. Both 
operate by building an instrumented version of the 
test software (the compiler itself in our case). The 
gprof tool works with the compiler, the linker, and 
the loader; it is available from several UNIX ven- 
dors. Hiprof is an Atom tool'' " available only on 
DIGITAL UNIX; it does not require compiler or 
linker support. 

The benchmark exhibiting the performance prob- 
lem can then be compiled with the profiling version 
of the compiler, and the compilation profile can be 
captured. Using the display facilities of the tool, we 
can analyze the relevant portions of the execution 
profile. We can then compare this profile with that 
of the reference version to localize the problem to a 
specific area of compiler source. Once this informa- 
tion is available, a specific edit can be identified as 
the cause and a solution can be identified and 
implemented. Another round of measurement is 
needed to verify the repair is effective, similar to the 
procedure for addressing a functional regression. 

■ When the problem needs to be pinpointed more 
accurately than is possible with these profiling 
tools, we use the IPROBE tool, which can provide 
instruction- by-instruction details about the execu- 
tion of a function. 14 

We have used the following tools or processes for 
run-time analysis: 

■ We apply hiprof and gprof in combination, and 
the IPROBE tool as described above, to the 
run-time behavior of the test program rather than 
to its compilation. 

■ We analyze the NULLSTONE results by examining 
die detailed log file. This log identifies the problem 
and the macliine code generated. This analysis is usu- 
ally adequate since the tests are generally quite simple. 

■ If more detailed analysis is needed, e.g., to pin- 
point cache misses, we use the highly detailed 
results generated by the Digital Continuous 
Profiling Infrastructure (DCPI) tool.'"" DCPI can 
display detailed (average) hardware behavior on an 
instruction-by-instruction basis. Any scheduling 
problems that may be responsible for frequent 
cache misses can be identified from the DCPI out- 
put, whereas they may not always be obvious from 
casually observing the machine code. 

■ Finally, we use the estimated schedule dump and 
statistical data optionally generated by the GEM 
back end. 1 This dump tells us how instructions are 
scheduled and issued based on the processor archi- 
tecture selected. It may also provide information 
about ways to improve the schedule. 
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In the rest of this section, we discuss three examples 
of applying analysis tools to problems identified by the 
performance measurement scripts. 

Compile-Time Test Case 

Compile-time regression occurred after a new opti- 
mization called base components was added to the 
GEM back end to improve the run-time performance 
of structure references. Table 1 gives compile-time test 
results that compare the ratios of compile times using 
the new optimized back end to those obtained with 
the older back end. The results for the iostream test 
indicate a significant degradation of 25 percent in the 
compile speed for optimize mode, whereas the perfor- 
mance in the other two modes is unchanged. 

To analyze this problem, we built hi prof versions of 
the two compilers and compiled the iostream bench- 
mark to obtain its compilation profile. Figures 1 a and 
lb show the top contributions in the flat hiprof pro- 
files from the two compilers. These profiles indicate 
that the number of calls made to cse and gem_il_peep 
in the new version is greater than that of the old one 
and that these calls are responsible for performance 
degradation. Figures 2a and 2b show the call graph 
profiles for cse for the two compilers and show the calls 
made by cse and the contributions of each component 



called by cse. Since these components are included in 
the GEM back end, the problem was fixed there. 

Run-Time Test Cases 

For the run-time analysis, we used two different test 
environments, the Haney kernels benchmark and the 
NUEESTONE test run against gcc. 

Haney Kernels The Hancy kernels benchmark is a 
synthetic test written to examine the performance of 
specific C++ language features. In this run-time test 
case, an older C++ compiler (version 5.5) was com- 
pared with a new compiler under development (version 
6.0). The Haney kernels results showed that the ver- 
sion 6.0 development compiler experienced an overall 
performance regression of 40 percent. We isolated the 
problem to the real matrix multiplication function. 
Figure 3 shows the execution profile for this function. 

We then used the DCPI tool to analyze perfor- 
mance of the inner loop instructions exercised on ver- 
sion 6.0 and version 5.5 of the C++ compiler. The 
resulting counts in Figures 4a and 4b show that die 
version 6.0 development compiler suffered a code 
scheduling regression. The leftmost column shows the 
average cycle counts for each instruction executed. 
The reason for this regression proved to be that a test 



Table 1 

Ratios of CPU (User and System) Compile Times (Seconds) of the New Compiler to Those of the Old Compiler 



File Name 



Options 



Debug Mode 

-00 -g 



Default Mode 



Optimize Mode 

-04 -gO 



a1amch2 


0.970 


0.970 


0.930 


collevol 


0.910 


0.780 


0.740 


d_inh 


0.970 


0.960 


0.960 


e_rvirt_yes 


0.970 


0.980 


0.960 


interfaceparticle 


0.880 


0.790 


0.730 


iostream 


0.990 


0.980 


1.250 


pistream 


0.890 


0.760 


0.790 


t202 


0.970 


0.970 


1.130 


t300 


0.980 


0.960 


1.040 


t601 


E010 


1.020 


1.010 


t606 


1.000 


1.020 


1.020 


t643 


1.020 


1.010 


1.000 


test__complex_excepti 


0.960 


0.890 


0.830 


test_complex_math 


0.970 


0.950 


0.950 


test_demo 


0.950 


0.830 


0.780 


test_generic 


1.000 


1.020 


1.100 


test_task_queue6 


0.970 


0.920 


0.960 


test_task_rand1 


0.950 


0.890 


0.890 


test_vector 


0.970 


0.920 


1.120 


vectorf 


0.890 


0.790 


0.850 


Averages 


0.961 


0.920 


0.952 
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granularity: cycles; units: seconds; total: 48.96 seconds 
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self 


total 




time 


seconds 


seconds 


calls 


ms/call 


ms/cal 1 


name 


2.8 


1.37 


1 .37 


10195 


0.13 


0.13 


cse [12] 


2.8 


2 .66 


1.29 


219507 


0.01 


0.01 


gem_i l_j»eep (31] 


2.6 


3 .93 


.2 


515566 


0.00 


0 .00 


gem_f i_ud_access_resource 


2 . 4 


5.09 


1.17 


481891 


0.00 


0.00 


gem_ m_get_nz [37] 


2.3 


6.23 


1 . 14 


713176 


0.00 


0.00 


_0tsZero [75] 



(a) Hiprof Profile Showing Instructions Executed with the New Compiler 



granularity: cycles; units: seconds; total: 27.49 seconds 
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total 




time 


seconds 


seconds 


calls 


ms/call 


ms/call 


name 


3.0 


0.83 


0.83 


143483 


0 .01 


0 . 01 


gem_il_peep [40] 


2.7 


1 .58 


0.75 


614350 


0 .00 


0.00 


_0ts2ero [64] 


2.5 


2 .26 


0 .68 


8664 


0.08 


0. 08 


cse [16] 


1.7 


2. 71 


0.45 


465634 


0.00 


C .00 


gem_f i_ud_access_resource 


1.8 


3.14 


0.43 


423144 


0 .00 


0 .00 


gem_vm_get_nz [3 6] 



(b) Hiprof Profile Showing Instructions Executed with the Old Compiler 



Figure 1 

Hiprof Profiles of Compilers 

for pointer disambiguation outside the loop code was 
not performed properly in the version 6.0 compiler. 
The test would have ensured that the pointers a and t 
were not overlapping. 

We traced the origin of this regression back to the 
intermediate code generated by the two compilers. 
Here we found that the version 6.0 compiler used a 
more modern form of array address computation in 
the intermediate language for which the scheduler had 
not yet been tuned properly. The problem was fixed in 
the scheduler, and the regression was eliminated. 



Initial NULLSTONE Test Run against gcc We measured 
the performance of the DEC C compiler in compiling 
the NULLSTONE tests and repeated the performance 
measurement of the gcc 2.7.2 compiler and libraries 
on the same tests. Figures 5a and 5b show the results 
of our tests. This comparison is of interest because gcc 
is in the public domain and is widely used, being the 
primary compiler available on the public-domain 
Linux operating system. Figure 5a shows the tests in 
which the DEC C compiler performs at least 10 per- 
cent better than gcc. Figure 5b indicates the optimiza- 
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(a) Hierarchical Profile for cse with the New Compiler 
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(b) Hierarchical Profile for cse with the Old Compiler 



Figure 2 

Hierarchical Call Graph Profiles for cse 
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void rmatMulHC(Real * t, 
const Real * a, 
const Real * b, 

const int !*. const int N, const int K ) 

{ 

int i, j, k; 
Real temp; 

memsetU, 0, H * N * sizeof (Real ) ) ; 

for ( j = 1 ; j < = N; j + r ) 

{ 

for (k = 1; k K; k*+) 
( 

temp = b[k - 1 , K • ( j 1)1; 
if (temp !> 0.0) 

{ 

for (i » 1; L <s M; i+<] 
t[i - 1 ♦ M « ( j - 1M +- 
temp *8[i-l»K* (k - 1) ) ; 

I 

) 

) 



Figure 3 

Haney Loop for Real (Matrix Multiplication 

tion tests in which the DEC C compiler shows 10 per- 
cent or more regression compared to gcc. 

We investigated the individual regressions by look- 
ing at the detailed log of the run and then examining 
the machine code generated for diose test cases. In diis 
case, the alias optimization portion showed that the 
regressions were caused by the use of an outmoded 
standard 15 as the default language dialect (-stdO) for 
DEC C in the DIGITAL UNIX environment. After we 
retested with the ansi_alias option, these regres- 
sions disappeared. 

We also investigated and fixed regressions in 
instruction combining and if optimizations. Other 
regressions, which were too difficult to fix within the 
existing schedule for the current release, were added 
to the issues list with appropriate priorities. 

Conclusions 

The measurement and analysis of compiler performance 
has become an important and demanding field. The 
increasing complexity of CPU architectures and the 
addition of new features to languages require die devel- 
opment and implementation of new strategies for test- 
ing the performance of C and C++ compilers. By 
employing enhanced measurement and analysis tech- 
niques, tools, and benchmarks, we were able to address 
these challenges. Our systematic framework for com- 
piler performance measurement, analysis, and prioriti- 
zation of improvement opportunities should serve as an 
excellent starting point for the practitioner in a situation 
in which similar requirements are imposed . 
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DCPI Profiles of the Inner Loop 
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Figure 5a 

NULLSTONE Results Comparing gec with DEC C Compiler, Showing All Improvements of Magnitude 10% or More 
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Figure 5b 

NULLSTONE Results Comparing gee with DEC C Compiler, Showing All Regressions of 10% or Worse 
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August G. Reinig 



Alias Analysis in the 
DEC C and DIGITAL C++ 
Compilers 



During alias analysis, the DEC C and DIGITAL C++ 
compilers use source-level type information to 
improve the quality of code generated. Without 
the use of type information, the compilers 
would have to assume that any assignment 
through a pointer expression could modify any 
pointer-aliased object. In contrast, through the 
use of type information, the compilers can 
assume that such an assignment can modify 
only those objects whose type matches that 
referenced by the pointer. 



When rwo or more address expressions reference the 
same memory location, these address expressions are 
aliases for each other. A compiler performs alias analy- 
sis to detect which address expressions do not refer- 
ence the same memory locations. Good alias analysis is 
essential to the generation of efficient code. Code 
motion out of loops, common subexpression elimina- 
tion, allocation of variables to registers, and detection 
of uninitialized variables all depend upon the compiler 
knowing which objects a load or a store operation 
could reference. 

Address expressions may be symbol expressions 
or pointer expressions. In the C and C++ languages, 
a compiler always knows what object a symbol expres- 
sion references. The same is not true with pointer 
expressions. Determining which objects a pointer 
expression may reference is an ongoing topic of 
research . 

Most of the research in this area focuses on the use 
of techniques that track which object a pointer expres- 
sion might point to.'- 2 When these techniques cannot 
make this determination, they assume that the pointer 
expression points to any object whose address has 
been taken. These techniques generally ignore the 
type information available to the source program. The 
best techniques perform interprocedural analysis to 
improve their accuracy. Although effective, the cost of 
analyzing a complete program can make this analysis 
impractical. 

In contrast, the DEC C and DIGITAL C++ compil- 
ers use high-level type information as they perform 
alias analysis on a routine -by-routine basis. Limiting alias 
analysis to within a routine reduces its cost, albeit at 
the cost of reducing its effectiveness. 

The use of this type information results in slight 
improvements in the performance of some standard - 
conforming C and C++ programs. These improve- 
ments come at little expense in terms of compilation 
time. There is, however, a risk that the use of this type 
information on nonstandard-conforming C or C++ 
programs may result in the compiler producing code 
that exhibits unexpected behavior. 
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The C and C++ Type Systems 



The Side-effects Package 



Research available on the use of type information dur- 
ing alias analysis involves languages other than C and 
C++. 5 Traditionally, C is a weakly typed language. A 
pointer that references one type may actually point to 
an object of a different type. For this reason, most 
alias-analysis techniques ignore type information when 
analyzing programs written in C. 

The ISO Standard for C defines a much stronger 
typing system.'' In ISO Standard C, a pointer expres- 
sion can access an object only if the type referenced by 
the pointer meets the following criteria: 

■ It is compatible with the type of the object, ignor- 
ing type qualifiers and signedness. 

■ It is compatible with the type of a member of an 
aggregate or union or submembers thereof, ignor- 
ing type qualifiers and signedness. 

■ It is the char type. 

Thus, in Figure 1, the pointer p can point to A, B, 
C, or S (through S.sub.m) but not to T or F. The 
pointer q, being a pointer to char, can refer to any of 
A, B, C, S, T, or F. 

The proposed ISO Standard for C++ defines a simi- 
lar typing system for C++." The strength of the 
Standard C and C++ type systems allows the DEC C 
and DIGITAL C++ compilers to use type information 
during alias analysis. 

Many existing C applications do not conform to the 
Standard C typing rules. They use cast expressions to 
circumvent the Standard C type system. To support 
these applications, the DEC C compiler has a mode 
whereby it ignores type information during alias analy- 
sis. The DIGITAL C++ compiler also has such a mode. 
This mode exists to support those C++ programmers 
who circumvent the C++ type system. 



int A; 

signed int const B; 
unsigned int volatile C; 
struct ( 

struct ( 

int m ; 

} sub; 
} S; 

struct { 

short z; 
) T; 

float F; 

int *p; 

*q; 



Figure 1 

Code Fragment Associated with the Explanation of the 
Standard C Aliasing Rules 



The DEC C and DIGITAL C++ compilers are GEM 
compilers." The GEM compiler system includes a 
highly optimizing back end. This back end uses the 
GEM data access model to determine which objects a 
load or a store may access. GEM compiler front ends 
augment the GEM data access model with a side- 
effects package, i.e., an alias-analysis package. The 
side-effects package provides the GEM optimizer 
additional information about loads and stores using 
language-specific information otherwise unavailable 
to the GEM optimizer. 

The DEC C and DIGITAL C++ compilers share a 
common side-effects package. The DEC C and C++ 
side-effects package 

■ Determines which symbols, types, and parts thereof 
a routine references 

■ Determines the possible side effects of these references 

■ Answers queries from the GEM optimizer regarding 
the effects and dependencies of memory accesses 

Preserving Memory Reference Information 

The DEC C and DIGITAL C++ front ends perform 
lexical analysis and parsing of the source program, 
generating a GEM intermediate language (GEM IL) 
graph representation of the source program.'' A tuple 
is a node in the GEM I Land represents an operation in 
the source program. 

As the DEC C and DIGITAL C++ front ends gener- 
ate GEM IL, they annotate each fetch (read) and store 
(write) tuple with information describing the object 
being read or written. The front ends annotate fetches 
and stores of symbols with information about the sym- 
bol. They annotate fetches and stores through pointers 
with information about the type the pointer references. 
The annotation information includes information 
describing exactly which bytes of the symbol or type 
the tuple accesses. This allows the side-effects package 
to differentiate between access to two different mem- 
bers of a structure. 

Arrays Neither the DEC C nor the DIGITAL C++ 
front end differentiates between accesses to different 
elements of an array. Both assume that all array accesses 
are to the first element of the array. The GEM optimizer 
does extensive analysis of array references. 7 Being flow 
insensitive, the DEC C and C++ side-effects package 
can, at best, differentiate between two array references 
that both use constant indices. The GEM optimizer can 
do much more. 

What the GEM optimizer cannot do, however, is 
determine that an assignment through a pointer to an 
int does not change any value in an array of doubles. 
This is the purpose of the DEC C and C++ side-effects 
package. Mapping all array accesses to access the first 
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element of an array does not hinder this purpose and 
simplifies alias analysis of arrays. 

Tuple Annotation Example For die program fragment 
in Figure 2, the DEC C and DIGITAL C++ front ends 
generate the annotated tuples displayed in Table 1 . 

Intraprocedural Effects Analysis 

The GEM optimizer makes several optimization passes 
over a routine. During each optimization pass, die 
DEC C and C++ side-effects package provides alias 
analysis information to the GEM optimizer by means 
of the following procedures: 

■ Examining each tuple within the routine that refer- 
ences (reads or writes) memory, allocating effects 
classes that represent the memory that the tuple 
references 

■ Performing type-based alias analysis 

■ Responding to alias-analysis queries from the GEM 
optimizer 

To determine the possible side effects of a memory 
access, the side-effects package partitions memory into 
effects classes. An effects class represents all or part of 



struct s ( 

i nt X ; 

int y; 
} Vl, v2; 
int i ; 

double d[3] ; 
struct S *p; 

p->x = 3; 

vl.y = 3; 

v2 = vl; 

d[i] = d[0J; 



Figure 2 

Code Fragment Associaced with Tuple Annotation 
Example 



Table 1 

Tuple Annotations 



an object. To minimize the number of effects classes 
under consideration, the side-effects package creates 
effects classes for only those object regions referenced 
within the current routine. 

Having created effects classes for each referenced 
object region within the current routine, the side- 
effects package then associates a signature with each 
effects class. The signature for an effects class records 
the possible side effects of referencing the effects class. 
The side-effects package uses this signature to respond 
to queries from the GEM optimizer about the effects 
and dependencies of tuples and symbols within the 
current routine. 

Allocating Effects Classes There are two kinds of 
effects classes. The first kind represents a region of an 
individual object. The second kind represents a region 
of all allocated objects of a particular type. Allocated 
objects are those created by the mallocU function 
and its relatives or the C++ new operator. 

As it processes the tuples within a routine, the side- 
effects package examines the memory reference infor- 
mation associated with the tuple. The side-effects 
package creates an effects class for each different set of 
memory reference information it encounters. Two sets 
of memory reference information are different if they 
contain different start- or end-offset information or 
different symbol information. 

Two sets of memory reference information that 
contain different type information are different only if 
the two types are not ef fects equivalent. Two types are 
effects equivalent if they differ only in their signedness 
or their type qualifiers. The signed int type and the 
volatile unsigned int type are effects equivalent. An 
assignment through a pointer to a signed int may 
change the value of a volatile unsigned int. 

Typically, an effects class represents a complete 
object or an individual member of a structure. An 
effects class may represent a subregion of the region 
represented by another effects class. This occurs when- 
ever code references a whole structure as well as indi- 
vidual members of the structure. In the case of unions, 
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struct S * 
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if two members occupy exactly the same memory loca- 
tions, a single effects class represents both members. 

For the program fragment in Figure 3, the side- 
effects package creates the effects classes displayed in 
Table 2. 

There is only one effects class for *uip and *ip since 
uip and ip may point to the same object. There are no 
effects classes for bytes 0 through 3 ofs and struct S as 
there are no references to s.x or sp->x. By allocating 
effects classes for only those object regions referenced 
within the routine, the side-effects package greatly 
reduces both the number of effects classes and the 
time required to perform alias analysis. 

In the traditional C type system, a pointer expres- 
sion may point to anything, regardless of type. To rep- 
resent this, the side-effects package creates exactly one 
effects class to represent allocated objects. It ignores 
the type and the start- and end-offset information. 



struct S { 
i n t x ; 
struct T { 
int y; 
float z; 

} t; 

) s; 

struct S *sp; 
signed int *ip; 
unsigned int *uip; 
float *fp; 

*uip = *ip; 
*Ep = 2; 
sp->t = S.t; 
sp->t.y = 2; 
S - *sp; 



Figure 3 

Code Fragment Associated with Allocating Effects Classes 



Table 2 

Effects Classes Using the Standard CType Rules 



Using the traditional C type system, for the program 
fragment shown in Figure 3, the side-effects package 
creates the effects classes displayed in Table 3. Here, 
effects class 7 replaces effects classes 7 through 1 1 in 
Table 2. All the differentiation by types disappears. 

Effects-class Signatures Having created the effects 
classes, the side-effects package associates a signature 
with each effects class. In addition, it associates an 
effects-class signature with each tuple within the rou- 
tine and each symbol referenced within the routine. 

Ail effects-class signature records the possible side 
effects of referencing an effects class. A reference to 
one effects class may reference another effects class. 
The effects class for a load through a pointer to an int 
indicates that the load references an allocated int 
object. The pointer to an int may actually reference a 
pointer-aliased int symbol or an int mem ber of a struc- 
ture or union. 

An effects-class signature is a subset of all the effects 
classes that might be referenced by a tuple. There is 
only one requirement for an effects-class signature: If 
two tuples may refer to the same part of memory, the 
intersection of their respective effects-class signatures 
must be non-null. If two tuples cannot refer to the 
same part of memory, it is desirable that the intersec- 
tion of their effects-class signatures is null. Ai empty 
intersection leads to more optimization opportunities. 

The most obvious rule for building an effects-class 
signature is to include in it all the effects classes that 
might be touched by a reference to the effects class. 
This leads to suboptimal code in cases such as that 
shown in Figure 4. 

There are three effects classes for this code, s<0,3>, 
s<4,7>, and s<0,7>, generated by references to s.x,s.y, 
and s, respectively. If the effects-class signature for 
s<0,3> includes both s<0,3> and s<0,7> and the 
effects-class signature for s<4,7> includes both s<4,7> 
and s<0,7>, then the intersection of these two effects- 



Effects Class 


Type or 
Symbol 


Start Offset 


End Offset 


Source Generating 
Effects Class 


1 


s 


0 


11 


s 


2 


s 


4 


11 


s.t 


3 


SP 


0 


7 


SP 


4 


fp 


0 


7 


fp 


5 


ip 


0 


7 


'P 


6 


uip 


0 


7 


uip 


7 


struct S 


0 


11 


*sp 


8 


struct S 


4 


11 


sp->t 


9 


struct S 


4 


7 


sp->t.y 


10 


float 


0 


3 


*fp 


11 


int 


0 


3 


*uip and *ip 
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Table 3 

Effects Classes Using the Traditional CType Rules 



bTTects Class 


Type or Symbol 


Start Offset 


End Offset 


Source Generating Effects Class 


1 


s 


0 


11 


s 


2 


s 


4 


11 


s.t 


3 


SP 


0 


7 


sp 


4 


fp 


0 


7 


fp 


5 


ip 


0 


7 


ip 


6 


ulp 


0 


7 


uip 


7 


char 


0 


1 


*sp, sp->t, *uip, sp->t.y, *fp, *ip 



class signatures is non-null. This falsely indicates that 
s.x and s.y may refer to the same memory location. This 
forces GEM to generate code that stores s.y after stor- 
ing to s.x. 

The DEC C and C++ side-effects package uses more 
effective rules for building effects-class signatures. These 
rules offer more optimization opportunities while pre- 
serving necessary dependency information. 

Effects-class Signatures for Symbols If an effects class 
represents a region A of a symbol, its signature includes 
itself. Its signature also includes all effects classes repre- 
senting regions of the symbol wholly contained within 
A. Finally, it includes any effects class representing a 
region of the symbol that partially overlaps A. It does 
not include effects classes representing regions of the 
symbol that do not overlap A or that wholly contain A. 

Table 4 gives the symbol effects-class signatures for 
the three effects classes under discussion. 

The inclusion ofsubregions in an effects-class signa- 
ture means that references to symbols interfere with 
references to members therein and vice versa. Excluding 
super-regions in an effects-class signature means that 



struct S { 
int x; 
i n t y ; 

> s; 

s . x = 

s.y = 

return s; 



Figure 4 

Example of Problematic Code for the Naive Rule for 
Building Effects-class Signatures 



Table 4 

Symbol Effects-class Signatures 

Effects Class Effects-class Signature 

s<0,3> s<0,3> 
s<4,7> s<4,7> 
s<0,7> <0,3>, s<4,7>, s<0,7> 



references to two separate members of a symbol do 
not interfere with each other. In Table 4, the effects- 
class signatures fors<0,3> and s<4,7> do not interfere 
with each other. Both signatures interfere with the 
effects-class signature for s<0,7>. 

The inclusion of effects classes representing partially 
overlapping regions of a symbol allows for the correct 
representation of the side effects of referencing sub- 
members of complex unions. 

Effects-class Signatures for Types If an effects class 
represents a region of a type, the contents of its signa- 
ture depends upon the type. If the type is the char type, 
the effects-class signature contains all the effects classes 
representing regions of other tvpes or pointer-aliased 
symbols. This reflects the C and C++ type rules, which 
state that a pointer to a char can point to anything. 

If the t\pe is some type T other than char, the effects- 
class signature contains effects classes representing: 

■ Those regions ofT that overlap the region of T the 
effects class represents, using the same overlap rules 
as for symbols 

■ Any region of a pointer-aliased symbol whose type 
is compatible to T, ignoring type qualifiers and 
signcdness 

■ A region of a pointer-aliased aggregate or union 
symbol that contains a member or submember 
whose type is compatible to T, ignoring type quali- 
fiers and signcdness 

■ A region of an aggregate or union type that con- 
tains a member or submember whose type is com- 
patible toT, ignoring type qualifiers and signcdness 

Table 5 gives the signatures for the effects classes in 
Table 2, assuming that the symbol s is pointer aliased. 

Including the effects classes of symbols in the effects- 
class signatures of types records the interference of 
references through pointers with references to pointer- 
aliased symbols. In Figure 3, the pointer uip points to 
an unsigned int. The member s.t. y has type int. Thus, 
uip may point to s.t.y. The member s.t contains s.t.y. 
Thus, the signature for the effects-class int<0,3> con- 
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Table 5 

Type Effects-class Signatures 



Number 


Effects Class 


Effects-class Signature 


1 


S<U, I I > 


i -) 


-> 


S<4, 1 1 > 


i 


i 


sp<0,7> 


5 


a 
4 


tp<u, /> 


n 
4 


5 


ip<0,7> 


5 


6 


uip<0,7> 


6 


7 


struct S<0, 1 1 > 


1. 2,7,8,9 


8 


struct S<4, 1 1 > 


1,2,8,9 


9 


struct S<4,7> 


1, 2, 9 


10 


float<0,3> 


1,2,7, 8,10 


11 


int<0,3> 


1,2,7,8,9,11 



cains the effects-class s<4,ll>. This means that the 
load of s.t depends upon the store through uip. 

Including the effects classes of types in the signa- 
tures of the effects classes of other types records the 
interference of references through a pointer with ref- 
erences through pointers to other types. In Figure 3, 
the pointer fp points to a float object. The member 
sp->t.z has type float. Thus, fp may point to sp->t.z. 
The member sp->t contains sp->t.z. Thus, the signa- 
ture forthe effects-class float<0,3> contains the effects- 
class struct S<4,11>. This reflects the fact that the 
store to sp->t.y depends upon the store through fp, 
i.e., it must occur after the store through fp. 

Even though the signature for the effects-class 
float<0,3> contains the effects-class struct S<4,11> 
(sp->t), it does not contain the effects-class struct 
S<4,7> (sp->t.y). There is no float member of struct 
S whose position within struct S overlaps bytes 4 
through 7 of struct S. There is a float member of struct 
S, namely z, whose position within struct S overlaps 
bytes 4 through 11 of struct S. The signature for the 
effects-class float<0,3> would not contain the effects- 
class s<0,3> if it existed. There is no float member of s 
whose position overlaps bytes 0 through 3 of s. 

Additional Effects-class Signatures The side-effects 
package creates a special effects-class signature repre- 
senting the side effects of a call. A called procedure 
may reference the following: 

■ Any pointer-aliased symbol (by means of a refer- 
ence through a pointer) 

■ Any allocated object (by means of a reference 
through a pointer) 

■ Any nonlocal symbol (by means of direct access) 

■ Any local static symbol (by means of recursion) 

The effects signature for a call includes all the effects 
classes representing these objects. 



Responding to Optimizer Queries During optimiza- 
tion, the optimizer makes two types of queries to the 
side-effects analysis routines: dominator-based queries 
and nondominator-based queries. 

When doing nondominator-based optimizations, the 
optimizer uses a bit vector to represent those objects a 
write may change (its effects). A similar bit vector repre- 
sents those objects whose value a read may fetch (its 
dependencies). Each bit in the bit vector represents an 
effects class. If a tuple's effects-class signature contains 
an effects class, that effects class's bit is set in the tuple's 
bit vector. The optimizer uses the union of the bit vec- 
tors associated with a set of tuples to represent the com- 
bined effects or dependencies of diose tuples. 

Dominator-based queries involve finding the near- 
est dominating tuple that might write to the same 
memory location as the tuple in question. Tuple A 
dominates tuple B if every path from the start of the 
routine to B goes through A. 8 If both tuples A and C 
dominate B, tuple A is the nearer dominator if C dom- 
inates A. 

When doing dominator-based optimizations, the 
side-effects package represents the tuples in the cur- 
rent dominator chain as a stack, adding and removing 
tuples from the stack as GEM moves from one path 
in the routine's dominator tree to another. Searching 
a single stack for the nearest dominating tuple that 
might write the same memory as the tuple in question 
references could lead to 0(N 2 ) performance, where N 
is the number of tuples in the dominator chain. This 
worst-case behavior occurs when none of the tuples in 
a dominator chain affects any subsequent tuple in the 
chain. Each time the side-effects package searches the 
stack, it examines all the tuples in the stack. 

To avoid diis, the DEC C and C++ side-effects pack- 
age creates a stack for each effects class. When pushing 
a tuple, the side-effects package pushes the tuple on 
each stack associated widi an effects class in the tuple's 
effects-class signature. When the GEM optimizer tells 
the side-effects package to find the nearest dominating 
write for a tuple, the side-effects package need only 
choose the nearest of those tuples that are on the top 
of the stacks associated with the tuple's effects-class 
signature. It need only look at the top of each stack, 
because a tuple would not be in the stack unless it 
might affect objects in the effects class associated with 
the stack. 

The multistack worst-case behavior is (XNC). There 
are C separate stacks, one for each effects class. The 
effects-class signature for each effects class may con- 
tain all the other effects classes. This would mean that 
each of the /V tuples in the dominator chain would 
appear in each of the stacks. 

Although the worst-case behavior forthe multistack 
case is no better than the single-stack case (Cmay be 
equal to /V ), in practice there are often more tuples 
within a routine than effects classes. Furthermore, 
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effects-class signatures often contain a small number 
of effects classes. A small number of effects classes in 
an effects-class signature means that there are a small 
number of stacks to consider. Choosing the nearest 
dominator from among the top tuples on these stacks 
requires examining only a small numberof tuples. 

Cost of Using Type Information 

When compiling all of the SPECint95 test suite 9 using 
high optimization, alias analysis accounts for approxi- 
mately 5 percent of the compilation time. The use of 
Standard C type rules during alias analysis increases 
compilation time by less than 0.2 percent (time mea- 
sured in number of cycles consumed by the compiler 
as reported by Digital Continuous Profiling Infra- 
structure [DCPI] 10 ). The increase in compilation time 
varies from program to program but never exceeds 
0.5 percent. Handling the extra effects classes gener- 
ated by using Standard C type aliasing information 
accounted for most of the increase. 

Potentially, the cost of including type-aliasing infor- 
mation could be huge. Calculating which effects classes 
a reference through a char * pointer could touch is 
straightforward as shown by the algorithm in Figure 5. 

A much more complicated process is required to 
calculate which effects classes could be touched by a 
reference through a pointer to a type other than char. 
The algorithm in Figure 6 performs this process. 

Fortunately, the innermost section of this loop is 
rarely executed. The innermost section executes only 
if a routine references a structure either through a 
pointer or a pointer-aliased symbol, that structure 
contains a substructure, and the routine references the 
substructure through a pointer. 



Effectiveness 

The benchmark programs from the SPECint95 suite 
offer some convenient test cases for measuring the 
effectiveness of type-based alias analysis. The sources are 
readily available and portable. The programs conform 
to alias rules established by the American National 
Standards Institute (ANSI) and are compute intensive. 
Unfortunately, they do not contain floating-point cal- 
culations. This reduces the number of different types 
used in the programs. Type -based alias analysis works 
best when there are many different types in use. 

Three of the SPECint95 programs show no improve- 
ment when compiled using the Standard C typing rules 
as opposed to using the traditional C typing rules. 
These programs, namely compress, go, and li, do not 
use many different types and pointers to them. When 
all the pointers in a program are pointers to ints (go), 
there is only one effects class for all pointer accesses. 
Because the compiler has no way to differentiate 
among the objects touched by a dereference of a 
pointer expression, it generates identical code for these 
programs, regardless of the type rules used. The gen- 
erated code for li differs only slightly and only for 
infrequently executed routines. 

Changes in generated code for the remaining five 
benchmarks are more prevalent. Two benchmarks, 
ijpeg and perl, show a small reduction in the number 
ofloads executed but no meaningful reduction in the 
total number of instructions executed. The other 
three SPECint95 benchmarks show varying degrees 
of reduction in both the number of loads executed 
(see Table 6) and the total number of instructions 
executed (see Table 7). 



foreach pointer aliased symbol 

foreach effects class representing a region of the s nbol 

add that effects class to the effects class signature for char 



Figure 5 

Calculation of the Effects-class Signature of die Type char * 



foreach pointer aliased symbol »r type referenced through a pointer 
foreach member therein 

if the member's type is referenced through pointer 

foreach effects class representing a region of the member's type 

f«reach effects class representing a region of the symbol or type 
referenced through a pointer 
if the two effects class regions overlap 

add the symbol's or pointer's effects class to the effects 
class signature associated with the effect cl&ss 
representing the member's type 



Figure 6 

Calculation of the Effects-class Signature for Types Other Than char 
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Table 6 

Number of Loads Executed by the Select SPECint95 Benchmarks 



Millions of Loads Millions of Loads 

SPEC Benchmark Using Type Information without Type Information Percent Reduction 

gcc 10,268 10,365 0.9 

ijpeg 16,853 16,888 0.2 

m88ksim 13,889 14,157 1.9 

perl 11,260 11,296 0.3 

vortex 18,994 19,207 1.1 



Table 7 



Number of Instructions Executed by the Select SPECint95 Benchmarks 




Millions of Instructions 


Millions of Instructions 




SPEC Benchmark 


Using Type Information 


without Type Information 


Percent Reduction 


gcc 


42,830 


42,935 


0.2 


ijpeg 


82,844 


82,834 


0.0 


m88ksim 


72,490 


73,155 


0.9 


perl 


45,219 


45,252 


0.1 


vortex 


80,093 


80,607 


0.6 



The load and instruction counts are those reported 
by using Atom's pixie tool on the SPECint95 binaries 
to generate pixstat data."' 1 - The compiler used was a 
development C compiler. All compilations used the 
following switches: -fast, -04, -arch ev56, and 
-inline speed. The compilations using the 
Standard C type system used the -ansi_alias 
switch. The compilations using the traditional C type 
system used the -noansi_alias switch. The bench- 
mark binaries were run using the reference data set. 

DCP1'" measurements of the reduction in the num- 
ber of cycles consumed by these SPECint95 bench- 
marks showed no consistent reductions. Run-to-run 
variability in the data collected swamped any cycle- 
time reductions that might have occurred. Similarly, 
measurements of gains in SPECint95 v results due to 
the use of type information during alias analysis showed 
no significant changes. 

Changes in Generated Code 

The code-generation changes one sees in the SPECint95 
benchmarks are exactly what one would expect. 

The use of type information during alias analysis 
reduces the number of redundant loads. An example 
of this occurs in ijpeg, which contains the code sequence: 

ma i n - >r owgr oup_c t r 

= (JDIMENSION) {cinfo->min_DCT_scaled_size » 1); 
main->rowgroups_avai 1 

= (JDIMENS10N) fcinfo->min_nCT_scaleti_size + 2); 

in process_data_context. Using the traditional C type 
system, the compiler must assume that main->row 
group_ctr is an alias for cinfo->min_DCT_scaled_size. 



Thus, it must generate code that loads cinfo->min_ 
DCT_scaled_size twice. The Standard C type system 
allows the compiler to generate only one load of 
cinfo->min_DCT_scaled_size. 

Several of the benchmarks contain code similar to 
the following from conversion recipe in gcc: 

curr.ne.xt->list-=.opeode « -1; 
curr .next->list->to = from; 
curr .next->list->cost - 0; 
curr .next->list->prev = 0 ; 

Using traditional C type rules, the compiler must gen- 
erate four loads of curr.next->list. The compiler must 
assume that the pointer curr.next->list may point to 
itself, making curr.next->list->mcmber an alias for 
curr.next->list. The Standard C type rules allow the 
compiler to assume that curr.next->list does not point 
to itself. This allows the compiler to generate code that 
reuses the result of the first load of curr.next->list, 
eliminating three redundant loads. 

In another example in gcc, the use of Standard C 
type rules allows the compiler to move a load outside a 
loop. The following loop occurs in fixup_gotos: 

for (; lists; lists = TREE_CHAIN (lists)) 
if (TREE_CHAIN (lists) 

thisblock->data . block. outer_cleanups) 
TREE_ADDRESSABLE (lists) = 1 

Standard C type rules tell the compiler that the store 
generated by TREE. ADDRESS ABLE (lists) = 1 
cannot modify thisblock->data. block. outer_cleanups. 
This allows the compiler to generate code that fetches 
thisblock->data. block. outer_cleanups once before 
entering the loop. Using traditional C type rules, 
the compiler must generate code that fetches 
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thisblock->data. block. outer_cleanups each time it 
traverses the loop. 

Not only can type information reduce the number 
of redundant loads, it can reduce the number of redun- 
dant stores. In mSSksim, there are many routines simi- 
lar to the following: 

ir.t Efirsttscruct instruction *caid, 'union opcode *ptrl { 

p:r->gen.apcl = Gx3d; 
ptr->gen.dest - operands. value [0] ; 

<jen.0pc? - cmd-»Dpc . rrr; 
pCx ->gen. src2 - operands. value [1] ; 
return(01 ; 

} 

where opc 1 , dest, opc2, and src2 are bit fields sharing 
the same 32 bits (longword). Using traditional C typ- 
ing rules, ptr->gen and cmd->opc may be aliases for 
each other. Thus to implement the above routine, the 
compiler must generate code that performs the fol- 
lowing actions: 

■ Load ptr->gen 

■ Update bit fields ptr->gen.opcl and ptr->gen.dest 

■ Store ptr->gen 

■ Load cmd->opc.rrr 

■ Update bit fields ptr->gen.opc2 and ptr->gen.src2 

■ Store ptr->gen 

Using Standard C typing rules, the compiler does not 
have to generate the first store of ptr->gen. The assign- 
ments to ptr->gcn.opcl and ptr->gen.dest cannot 
change and >opc.rrr. In this case, alias analysis that is 
not type based would have a difficult time detecting 
that ptr->gen and cmd->opc do not alias each other. 
MSSksim never calls f first directly. It calls it by means 
of an array-indexed function pointer. 

A Note of Caution 

Many C programs do not adhere to the Standard C 
aliasing rules. Through die use of explicit casting and 
implicit casting, they access objects of one type by means 
of pointers to other types. More aggressive optimization 
by GEM combined with more detailed alias-analysis 
information from the DEC C and C++ side-effects 
package increasingly results in these programs exhibit- 
ing unexpected behavior when the compiler uses 
Standard C aliasing rules. 

Passing a pointer to one type to a routine that 
expects a pointer to another type works as expected, 
until the GEM optimizer inlines the called procedure. 
If the procedure is not inlined, the DEC C and C++ 
side-effects package must assume that the call conflicts 
with all pointer accesses before and after the call. Once 
GEM inlines the routine, the side-effects package is 
free to assume that references using the inlined pointer 
do not conflict with references using the pointer at the 
cal I site. The two pointers point to two different types. 



A recent example of this problem occurred in the 
gec program in the SPECint95 benchmark suite. All 
programs in this suite are supposed to conform to the 
Standard C type-aliasing rules. Because of an improve- 
ment to the GEM optimizer, this benchmark started 
to give unexpected results. In rtx_alloc, gec clears a 
structure by treating it as an array of ints, assigning 
zero to each element of the array. Subsequent to zero- 
ing this structure, gec assigns a value to one of the 
fields in the structure. Through a series of valid opti- 
mizations (given the incorrect type information), the 
resulting code did not clear all the fields in the struc- 
ture. This left uninitialized data in the structure, 
resulting in gec behaving in an unexpected manner. 

To avoid potential problems, the DEC C compiler, 
by default, does not use the Standard C type rules 
when performing alias analysis. The user of the com- 
piler has to explicitly assert that the program does fol- 
low the Standard C type rules through the use of a 
command-line switch. 

The DIGI TAL C++ compiler does assume that the 
C++ program it is compiling adheres to the Standard 
C++ type rules. A user of the DIGITAL C++ compiler 
can use a command-line switch to inform the compiler 
that it should use traditional C type rules when per- 
forming alias analysis. 

Summary 

Using Standard C type information during alias analysis 
does improve the generated code for some C and C++ 
programs. The compilation cost of using type informa- 
tion is small. Except for rare cases, performance gains 
resulting from these code improvements are small. Any 
programs compiled using type information during alias 
analysis must suictly adhere to die Standard C and C++ 
aliasing rules. If not, the optimizer may generate code 
diat produces unexpected results. 
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for Superscalar Systems: 

Global Instruction 
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Copies 



The performance of instruction-level parallel 
systems can be improved by compiler programs 
that order machine operations to increase 
system parallelism and reduce execution time. 
The optimization, called instruction scheduling, 
is typically classified as local scheduling if only 
basic-block context is considered, or as global 
scheduling if a larger context is used. Global 
scheduling is generally thought to give better 
results. One global method, dominator-path 
scheduling, schedules paths in a function's 
dominator tree. Unlike many other global 
scheduling methods, dominator-path schedul- 
ing does not require copying of operations 
to preserve program semantics, making this 
method attractive for superscalar architectures 
that provide a limited amount of instruction- 
level parallelism. In a small test suite for the 
Alpha 21 164 superscalar architecture, dominator- 
path scheduling produced schedules requiring 
7.3 percent less execution time than those pro- 
duced by local scheduling alone. 



Many of today's computer applications require compu- 
tation power not easily achieved by computer architec- 
tures that provide little or no parallelism. A promising 
alternative is the parallel architecture, more specifically, 
the instruction-level parallel (I LP) architecture, which 
increases computation during each machine cycle. I LP 
computers allow parallel computation of the lowest 
level machine operations within a single instruction 
cycle, including such operations as memory loads and 
stores, integer additions, and floating-point multiplica- 
tions. I LP architectures, like conventional architectures, 
contain multiple functional units and pipelined func- 
tional units; but, they have a single program counter 
and operate on a single instruction stream. Compaq 
Computer Coiporation's AlphaServer system, based on 
the Alpha 21164 microprocessor, is an example of an 
ILP machine. 

To effectively use parallel hardware and obtain 
performance advantages, compiler programs must 
identify the appropriate level of parallelism. For ILP 
architectures, the compiler must order the single 
instruction stream such that multiple, low-level opera- 
tions execute simultaneously whenever possible. This 
ordering by the compiler of machine operations to 
effectively use an ILP architecture's increased paral- 
lelism is called instruction scheduling. It is an opti- 
mization not usuallv found in compilers for non-ILP 
architectures. 

Instruction scheduling is classified as local if it 
considers code only within a basic block and global if 
it schedules code across multiple basic blocks. A dis- 
advantage to local instruction scheduling is its inability 
to consider context from surrounding blocks. While 
local scheduling can find parallelism within a basic 
block, it can do nothing to exploit parallelism between 
basic blocks. Generally, global scheduling is preferred 
because it can take advantage of added program paral- 
lelism available when the compiler is allowed to move 
code across basic block boundaries. Tjaden and Flynn, 1 
for example, found parallelism within a basic block 
quite limited. Using a test suite of scientific programs, 
they measured an average parallelism of 1.8 within 
basic blocks. In similar experiments on scientific pro- 
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grams in which the compiler moved code across basic 
block boundaries, Nicolau and Fisher 2 found paral- 
lelism that ranged from 4 to a virtually unlimited num- 
ber, with an average of 90 for the entire test suite. 

Trace scheduling is a global scheduling technique 
that attempts to optimize frequently executed paths of 
a program, possibly at the expense of less frequently 
executed paths. Trace scheduling exploits parallelism 
within sequential code by allowing massive migration of 
operations across basic block boundaries during schedul- 
ing. By addressing diis larger scheduling context (many 
basic blocks), trace schedulingcan produce better sched- 
ules than techniques that address die smaller context of a 
single block. To ensure the program semantics are not 
changed by interblock motion, trace scheduling inserts 
copies of operations that move across block boundaries. 
Such copies, necessary to ensure program semantics, are 
called compensation copies. 

The research described here is driven by a desire to 
develop a global instruction scheduling technique 
that, like trace scheduling, allows operations to cross 
block boundaries to find good schedules and that, 
unlike trace scheduling, does not require insertion of 
compensation copies. Like trace scheduling, DPS first 
defines a multiblock context for scheduling and then 
uses a local instruction scheduler to treat the larger 
context like a single basic block. Such a technique pro- 
vides effective schedules and avoids the performance 
cost of executing compensation copies. The global 
scheduling technique described here is based on the 
dominator relation* among the basic blocks of a func- 
tion and is called dominator-path scheduling (DPS). 

Local Instruction Scheduling 

Since DPS relies on a local instruction scheduler, we 
begin with a brief discussion of the local scheduling 
problem. As the name implies, local instruction sched- 
uling attempts to maximize parallelism within each 
basic block of a function's control flow graph. In gen- 
eral, this optimization problem is NP-complctc/ 
However, in practice, heuristics achieve good results. 
( Landskov et al." give a good survey of early instruction 
scheduling algorithms. Allan et al. 7 describe how one 
might build a retargetable local instruction scheduler.) 

List scheduling* is a general method often used for 
local instruction scheduling. Briefly, list scheduling 
typically requires two phases. The first phase builds 
a directed acyclic graph (DAG), called the data depen 
dence DAG (DDD), for each basic block in the 
funcrion. DDD nodes represent operations to be 
scheduled. The DDD's directed edges indicate that a 
node X preceding a node Y constrains X to occur no 



*A basic block. D, dominates another block, B, if'evcrv path from 
the roor of the control-How graph for a function to 1? must pass 
through O 



later than Y. These DDD edges are based on the formal- 
ism of data dependence analysis. There are three basic 
types of data dependence, as described by Padua et al.'' 

■ Flow dependence, also called true dependence or 
data dependence. A DDD node M> is flow depen- 
dent on DDD node M, if M, executes before M 2 and 
Mi writes to some memory location read by Mj. 

■ Antidependence, also called false dependence. A 
DDD node M 3 is antidependent on DDD node M, 
if Mi executes before M; and Ma writes to a mem- 
ory location read by Mi, thereby destroying the 
value needed by Mi. 

■ Output dependence. A DDD node M; is output 
dependent on DDD node Mi ifMp executes before 
Mj and M> and Mi both write to the same location. 

To facilitate determination and manipulation of 
data dependence, the compiler maintains, for each 
DDD node, a set of all memory locations «5a:/(read) 
and all memory locations defined (written) by that 
particular DDD node. 

Once the DDD is constructed, the second phase 
begins when list scheduling orders the graph's nodes 
into the shortest sequence of insuuetions, subject to 
(1) the constraints in the graph, and (2) the resource 
limitations in the machine (i.e., a machine is typically 
limited to holding only a single value at any time). In 
general list scheduling, an ordered list of tasks, called a 
priority list, is constructed. The priority list takes its 
name from the fact that tasks are ranked such that those 
with the highest priority are chosen first. In the context 
of local instruction scheduling, the priority list contains 
DDD nodes, all of whose predecessors have already 
been included in the schedule being constructed. 

Expressions, Statements, and Operations 

Within the context of this paper, we discuss algorithms 
for code motion. Before going further, we need to 
ensure common understanding among our readers for 
our use of terms such as expressions, statements, and 
operations. To start, we consider a computer program 
to be a list of operations, each of which (possibly) 
computes a right-hand side (rhs) value and assigns the 
rhs value to a memory location represented by a left- 
hand side (Ihs) variable. This can be expressed as 

A<-E 

where A represents a single memory location and E 
represents an expression with one or more operators 
and an appropriate number of operands. During dif- 
ferent phases of a compiler, operations might be repre- 
sented as 

■ Source code, a high-level language such as C 

■ Intermediate statements, a linear form of three- 
address code such as quads or n-tuples"' 
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■ DDD nodes, nodes in a DDD, ready to be sched- 
uled by the instruction scheduler 

Important to note about operations, whether repre- 
sented as intermediate statements, source code, or 
DDD nodes, is that operations include both a set of 
definitions and a set of uses. 

Expressions, in contrast, represent the rhs of an 
operation and, as such, include uses but not defini- 
tions. Throughout this paper, we use the terms state- 
merit, intermediate statement, operation, and DDD 
node interchangeably, because they all represent an 
operation, with both uses and definitions, albeit gen- 
erally at different stages of the compilation process. 
When we use the term expression, however, we mean 
an rhs with uses only and no definition. 

Dominator Analysis Used in Code Motion 

In order to determine which operations can move 
across basic block boundaries, we need to analyze the 
source program. Although there are some choices 
as to the exact analysis to perform, dominator-path 
scheduling is based upon a formalism first described by 
Reif and Tarjan." We summarize Reif and Tarjan's 
work here and then discuss the enhancements needed 
to allow interblock movement of operations. 

In their 1981 paper, Reif and Tarjan provide a fast 
algoridim for determining the approximate birthpoints 
of expressions in a program's flow graph. An expres- 
sion's birdipoint is the first block in the control flow 
graph at which the expression can be computed, and 
the value computed is guaranteed to be the same as in 
the original program. Their technique is based upon 
fast computation of the idef set for each basic block of 
the control flow graph. The idef set for a block B is 
that set of variables defined on a path between B's 
immediate dominator and B. Given that the domina- 
tor relation for the basic blocks of a function can be 
represented as a dominator tree, the immediate domi- 
nator, IDOM, of a basic block B is B's parent in the 
dominator tree. 

Expression birthpoints are not sufficient to allow us 
to safely move entire operations from a block to one of 
its dominators because birthpoints address only the 
movement of expressions, not definitions. Operations 
in general include not only a computation of some 
expression but the assignment of the value computed 
to a program variable. Ensuring a "safe" motion for an 
expression requires only that no expression operand 
move above any possible definition of that operand, 
thus changing the program semantics. A similar 
requirement is necessary, but not sufficient, for the 
variable to which the value is being assigned. In addi- 
tion to not moving A above any previous definition of 
A, A cannot move above any possible use of A. 
Otherwise, we run the risk of changing A's value for 



that previous use. Thus, dominator analysis computes 
the iuse set for each basic block and for the idef sex. 
The iuse set for a block, B, is that set of variables used 
on some path between B's immediate dominator and 
B. Using the idef and z'wsesets, dominator analysis com- 
putes an approximate birthpoint for each operation. 

In this paper, we use the term dominator analysis 
to mean the analysis necessary to allow code motion of 
operations while disallowing compensation copies. 
Additionally, we use the term dominator motion for 
the general optimization of code motion based upon 
dominator analysis. 

Enhancing the Reif and Tarjan Algorithm 

By enhancing Reif and Tarjan's algorithm to compute 
birthpoints of operations instead of expressions, we 
make several issues important that previously had no 
effect upon Reif and Tarjan's algoridim. This section 
motivates and describes the information needed to 
allow dominator motion, including the use, def iuse, 
and idef sets for each basic block. An algorithmic 
description of this dominator analysis information is 
included in the section Overview of Dominator-Path 
Scheduling and the Algorithm for Interblock Motion. 

When we allow code motion to move intermediate 
statements (or just expressions) from a block to one of 
its dominators, we run the risk that the statement 
(expression) will be executed a different number of 
times in the dominator block than it would have been 
in its original location. When we move only expres- 
sions, the risk is acceptable (although it may not be 
efficient to move a statement into a loop) since the 
value needed at the original point of computation is 
preserved. Relative to program semantics, the number 
of times the same value is computed has no effect as 
long as the correct value is computed the last time. 
This accuracy is guaranteed by expression birthpoints. 

Consider also the consequences of moving an expres- 
sion from a block that is never executed for some partic- 
ular input data. Again, i t may not be efficient to compute 
a value never used, but the computation does not alter 
program semantics. When dominator motion moves 
entire statements, however, the issue becomes more 
complex. If the statement moved assigns a new value to 
an induction variable, as in the following example, 

n = n + 1 

dominator motion would change n's final value if it 
moved the statement to a block where the execution 
frequency differed from that of its original block. We 
could alleviate this problem by prohibiting motion of 
any statement for which the use and def 'sets are not 
disjoint, but the possibility remains that a statement 
may define a variable based indirectly upon that vari- 
able's previous value. To remedy the more general 
problem, we disallow motion of any statement, S, 
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whose def set intersects with those variables that are 
used-before-defined in the basic block in which S resides. 

Suppose the optimizer moves an intermediate state- 
ment that defines a global variable from a block that 
may never be executed for some set of input data into 
a dominator block that is executed at least once for 
the same input data. Then the optimized version has 
defined a variable that the unoptimized function did 
not, possibly changing program semantics. We can be 
sure that such motion does not change the semantics 
of that function being compiled; but there is no mech- 
anism, short of compiling the entire program as a sin- 
gle unit, to ensure that defining a global variable in this 
function will not change the value used in another 
function. Thus, to be conservative and ensure that 
it does not change program semantics, dominator 
motion prohibits interblock movement of any state- 
ment that defines a global variable. At first glance, it 
may seem that this prohibition cripples dominator 
motion's ability to move any intermediate statements 
at all; but we shall see that such is not the case. 

One final addition to Reif and Tarjan information is 
required to take care of a subtle problem. As discussed 
above, dominator analysis uses the idef and iuse sets to 
prevent illegal code motion. The use of these sets was 
assumed to be sufficient to ensure the legality of code 
motion into a dominator block; unfortunately, this is 
not the case. The problem is that a definition might 
pass through the immediate dominator of B to reach 
a use in a sibling of B in the dominator tree. If there 
were a definition of this variable in B, but the variable 
was not defined on any path from the immediate dom- 
inator, there would be nothing in dominator analysis 
to prevent the definition from being moved into the 
dominator. But that would change the program's 
semantics. Figure 1 shows the control-flow graph for a 
function called findmax(), with only the statements 
referring to register r7. Register r7 is defined in blocks 
B3 and B7, and referenced in B9. This means that r7 
is live-out of B5 and live-in to B8, but not live-in to 
B7; there is a definition of r7 in B3 that reaches B8. 
Because there is no definition or use between B7 and 
its immediate dominator B5, the idef and iuse sets of 
B7 are empty; thus, dominator analysis, as described 
above, would allow the assignment of r7 to move 
upward to block B5. This motion is illegal; it changes 
the definition in B3. Moving the operation from B7 to 
B5 changes the conditional assignment of r7 to an 
unconditional one. 

To prevent this from happening, we can insert the 
variable into the iuse set of the block B, in which we 
wish the statement to remain. We do not, however, 
want to add to the iuse set unnecessarily. The solution 
is to add each variable, V, that is live-in to any of B's 
siblings in the dominator tree, but not into B, or to B's 
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Figure 1 

Control Flow Graph for the Function findmax() 



iuse set. This will prevent any definition of V that 
might exist in B from moving up. If there is a defini- 
tion of V in B, but V is live-in to B, there must be some 
use of V in B before the definition, so it could not move 
upward in any case. 

Measurement of Dominator Motion 

To measure the motion possible in C programs, 
Sweany 12 defined dominator motion as the movement 
of each intermediate statement to its birthpoint as 
defined by dominator analysis and by the number of 
dominator blocks each statement jumps during such 
movement. Sweany's choice of intermediate state- 
ments (as contrasted with source code, assembly lan- 
guage, or DDD nodes) is attributed to the lack of 
machine resource constraints at that level of program 
abstraction. He envisioned dominator motion as an 
upper bound on the motion available in C programs 
when compensation copies are included. In the test 
suite of 12 C programs compiled, more than 25 per- 
cent of all intermediate statements moved at least one 
dominator block upwards toward the root of the dom- 
inator tree. One function allowed more than 50 per- 
cent of the statements to be hoisted an average of 
nearly eight dominator blocks. The considerable 
amount of motion (without copies) available at the 
intermediate statement level of program abstraction 
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provided us with the motivation to use similar analysis 
techniques to facilitate global instruction scheduling. 

Overview of Dominator-path Scheduling and the 
Algorithm for Interblock Motion 

Since experiments show that dominator analysis allows 
considerable code motion without copies, we chose to 
use dominator analysis as the basis for the instruction 
scheduling algorithm described here, namely dominator- 
path scheduling. As noted above, DPS is a global 
instruction scheduling method that does not require 
copies of operations that move from one basic block to 
another. DPS performs global instruction scheduling by 
treating a group of basic blocks found on a dominator 
tree path as a single block, scheduling the group as a 
whole. In this regard, it resembles trace scheduling, 
which schedules adjacent basic blocks as a single block. 
DPS's foundation is scheduling instructions while mov- 
ing operations among blocks according to both the 
opportunities provided by and the restrictions imposed 
by dominator analysis. 

The question arises as to how to exploit dominator 
analysis information to permit code motion at the 
instruction level during scheduling. DPS is based on 
the observation that we can use idef and iuse sets to 
allow operations to move from a block to one of its 
dominators during instruction scheduling. Instruction 
scheduling can then choose the most advantageous 
position for an operation that is placed in any one of 
several blocks. Because machine operations are incor- 
porated in nodes of the DDD used in scheduling and, 
like intermediate statements, DDD nodes are repre- 
sented by clef and use sets, the same analysis performed 
on intermediate statements can also be applied to a 
basic block's DDD nodes. 

The same motivation that drives trace scheduling — 
namely that scheduling one large block allows better use 
of machine resources than scheduling the same code as 
several smaller blocks — also applies to DPS. I n contrast 
to trace scheduling, DPS does not allow motion of 
DDD nodes when a copy of a node is required and does 
not incur the code explosion due to copying that trace 
scheduling can potentially produce. For architectures 
with moderate instruction-level parallelism, DPS may 
produce better results than trace scheduling, because 
the more limited motion may be sufficient to make 
good use of machine resources, and unlike trace sched- 
uling, no machine resources are devoted to executing 
semantic-preserving operation copies. 

Much like traces,* the dominator path's blocks can 
be chosen by any of several methods. One method is a 
heuristic choice of a path based on length, nesting 
depth, or some other program characteristic. Another 
is programmer specification of the most important 



•groups of blocks to be scheduled together in trace scheduling 



paths. A third is actual profiling of the running pro- 
gram. We visit this issue again in the section Choosing 
Dominator Paths. First, however, we need to discuss 
the algorithmic details of DPS. 

Once DPS selects a dominator path to schedule, it 
requires a method to combine the blocks' DDDs into 
a single DDD for the entire dominator path. In our 
compiler, this task is performed by a DDD coupler,' 5 
which is designed for the purpose. Given the DDD 
coupler, DPS proceeds by repeatedly 

■ Choosing a dominator path to schedule 

■ Using the DDD coupler to combine each block's 
DDD on the chosen dominator path 

■ Scheduling the combined DDD as a single block 

The dominator-path scheduling algorithm, detailed 
in this section, is summarized in Figures 2 and 3. 

A significant aspect of the DPS process is to ensure 
"appropriate" interblock motion of DDD nodes and 
to prohibit "illegal" motion. As noted earlier, the 
combined DDD for a dominator path includes control 
flow. Therefore, when DPS schedules a group of 
blocks represented by a single DDD, it needs a mecha- 
nism to map correctly the scheduled instructions to 
the basic blocks. The mechanism is easily accom- 
plished by the addition of two special nodes to each 
block's DDD. Called BlockStart and BlockEnd, these 
special nodes represent the basic block boundaries. 
Since dominator-path scheduling does not allow 
branches to move across block boundaries, each 
BlockStart and BlockEnd node is initially "tied" (with 
DDD arcs) to the branch statement of the block, if any. 
Because BlockStart and BlockEnd are nodes in the 
eventually combined DDD, they arc scheduled like all 
other nodes of the combined DDD. After scheduling, 
all instructions between the instruction containing the 
BlockStart node for a block and the instruction con- 
taining the BlockEnd node for that block are consid- 
ered instructions for that block. Next, DPS must 
ensure that the BlockStart and BlockEnd DDD nodes 
remain ordered (in the scheduled instructions) relative 
to one another and to the BlockStart and BlockEnd 
nodes for any other block. To do so, DPS adds use and 
clef information to the nodes to represent a pseudore- 
source, BlockBoundary. Because each BlockStart 
node defines BlockBoundary and each BlockEnd 
node uses BlockBoundary, no BlockEnd node can be 
scheduled ahead of its associated BlockStart node 
(because of flow dependence.) Also, a BlockStart node 
cannot be scheduled before its dominator block's 
BlockEnd node (because of antidependence). By 
establishing these imaginary dependencies, DPS 
ensures that the DDD coupler adds arcs between ail 
BlockStart and BlockEnd nodes. 



62 Digital Technical Journal 



Vol. 10 No. 1 1998 



A 1 (r/-»rirh 
iUiiOI 111 1 


m I Inniimtni'-Pnth hf><H 1 1 1 in fr 
11 LSUl 1 111 ldLOI 1 dill OC1 1CCI Ulll 1 ti 


Input: 






Function Control Flow Graph 




Dominator Tree 




Post-Dominator Tree 


Output: 






Scheduled instructions for the function 


Algorith 


m: 




While at least one Basic Block is unscheduled 




Heuristically choose a path Bi, B 2 ,..., B„ in the Dominator Tree that includes 




only unscheduled Basic Blocks. 




Perform dominator analysis to compute IDef and [Use sets 

J r 




/* Build one DDD for the entire dominator path */ 








For i - 2 to n 




T = InitializeTransitionDDD (B, .,, B,) 




CombinedDDD = Couple(CombinedDDD,T) 




CombinedDDD = Couple (CombinedDDD, B, ) 




Perform lisr srh^H uhnt? on (""ombineHOl^O 




Mark each block of DP scheduled 




Copy scheduled instructions to the Blocks of the path (instructions between the 




BlockStart and BlockEnd nodes for a Block are "written" to that Block) 
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Figure 2 

Dominator-parh Scheduling Algorithm 



Looking back to dominator analysis, we see that 
interblock motion is prohibited if the operation being 
moved 

■ Defines something that is included in either the 
i clef or it tse set 

■ Uses something included in the idef set for the 
block in which the operation currently resides 

To obtain the same prohibitions in the combined 
DDD, we add the itlej 'set for a basic block, B, to the 
clef set B's BlockStart node. Similarly, we add the iuse 
set for B to the usesct of B's BlockStart node. Thus we 
enforce the same restriction on movement that domi- 
nator analysis imposed upon intermediate statements 
and ensure that any interblock motion preserves pro- 
gram semantics. In a similar manner, DPS includes the 
restrictions on movement of operations that define 
either global variables or induction variables. Figure 3 
gives an algorithmic description of the process of 
"doping" the BlockStart and BlockEnd nodes to pre- 
vent disallowed code motion. 

DPS is complicated by factors not relevant for dom- 
inator motion of intermediate statements. Foremost is 
die complexity imposed by the bidirectional motion of 



operations that instruction scheduling allows. In dom- 
inator motion, intermediate statements move in only 
one direction, i.e., toward the top of the function's 
control flow graph, not from a dominator block to a 
dominated one. This one-directional motion is rea- 
sonable when attempting to move intermediate state- 
ments because one statement's movement will likely 
open possibilities for more motion in the same direc- 
tion by other statements. When statements move in 
different directions, one statement's motion might 
inhibit another's movement in the opposite direction. 
The goal of dominator motion is to move statements as 
far as possible in the control flow graph. In contrast, the 
goal of DPS is not to maximize code motion, but radier 
to find, for each operation, O, that location for O diat 
will yield the shortest schedule. Thus our goal has 
changed from that of dominator motion. To gain the 
full benefit from DPS, we wish to allow operations to 
move past block boundaries in either direction. To per- 
mit bidirectional motion, we use the post-dominator 
relation, which says that a basic block, PD, is a post- 
dominator of a basic block B if all paths from B to the 
function's exit must pass through PD. Using this strat- 
egy, we similarly define post-iclefmd post-i use sets. In 
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Algorithm InitializeTransitionDDD(B,, B>) 
Input: 

A Transition DDD templates, with a Dummy DDDNode 

for Bi's block end and one for Bo's block start 

Two basic blocks, Bi and Bj that we wish to couple 

Dominator Tree 

Post- Dominator Tree 

The following dataflow information 



Def, Use, IDef, and I Use sets for B, and B 3 
Used-Before-Defined set for Bi 
PostTDef, and Post-IUse sets for B, and Bi 
B 2 's "sibling" set, defined to include any variable 

live-in to a dominator-tree sibling of B 2 , but not 

live-in to B 2 
A basic block DDD for each of B, and B 2 



Output: 

An initialized Transition DDD, T 
Algorithm: 

T = TransitionDDD 

/* "Fix" set for globaJ and induction variables. */ 
Add set of global variables to B 2 's I Use 
Add Bi's Used-Before-Defmed to B 2 's IUse 
Add B : 's sibling set to B 2 's IUse 

If Bj does not post-dominate Bi 

Add B, 's Use set to T's BlockEnd Def set 
Add Bi's Def set to T's BlockEnd Use set 

Else 

Add Bi's Post-IDcf set to T's BlockEnd Def set 
Add B,'s Post-IUse set to T's BlockEnd Use set 
Add B,'s IDef set to T's BlockStart Def set 
Add B:'s IUse set to T's BlockStart Use set 
Return T 



Figure 3 

Initialize Transition DDD Algorithm 



fact, it is not difficult to compute all these quantities 
for a function. The simplest way is to logically reverse 
the direction of all the control flow graph arcs and per- 
form dominator analysis on the resulting graph. 
Having computed the post-dominator tree, DPS 
chooses dominator paths such that the dominated 
node is a post-dominator of its immediate predecessor 
in a dominator path. This choice allows operations to 
move "freely" in both directions. Of course, this may 
be too limiting on the choice of dominator paths. To 
allow for the possibility that nodes in a dominator path 
will not form a post-dominator relation, DPS needs a 
mechanism to limit bidirectional motion when 
needed. Again, we rely on the technique of adding 
dependencies to the combined DDD. In this case 
(assuming that DPS is scheduling paths in the forward 
dominator tree), for any basic block, B, whose succes- 



DPS allows code movement along any dominator 
path, but there are many ways to select these paths. An 
investigation of the effects of dominator-path choice 
on the efficiency of generated schedules tells us that 
the choice of path is too important to be left to arbi- 
trary selection; twice the average percent speedup* for 
several functions can often be achieved with a simple, 



sor, S, in the forward dominator path does not post- 
dominate B, DPS adds B's elefset to the use set of the 
BlockEnd node associated with B. In similar fashion, 
we add B's use set to B's BlockEnd node's clef set. 
This technique prevents any DDD node originally in 
B from moving downward in the dominator path. 



Choosing Dominator Paths 



*(unoptimized_speed - optimizecUspceel)/ uuoptintwefi_$pccd 
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well-chosen heuristic. Some functions have a potential 
percent speedup almost four times the average. Thus, 
it is important to find a good, generally applicable 
heuristic to select the dominator paths. 

Unfortunately, it is not practical to schedule all of 
the possible partitionings for large functions. If we 
aJJovv a basic block to be included in only one domina- 
tor path, the formula for the number of distinct parti- 
tionings of the dominator tree is 

Yl [outdeg(rc) + l] 

where /V is the set of nodes of the dominator tree." 
Although the number of possible paths is not prohibi- 
tive for small dominator trees, larger trees have a pro- 
hibitively large number. For example, whetstone's 
main(), with 49 basic blocks, has almost two trillion 
distinct partitionings. 

To evaluate differences in dominator-path choices, 
we scheduled a group of small functions with DPS 
using every possible choice of dominator path. The 
target architecture for this study was a hypothetical 
6-wide long-instruction-word (LIW) machine, which 
was simulated and in which it was assumed that all 
cache accesses were hits. 

The results of exhaustive dominator-path testing 
show, as expected, that varying the choice of domina- 
tor paths significantly affects the performance of 
scheduling. For all functions of at least two basic 
blocks, DPS showed improvement over local schedul- 
ing for at least one of the possible choices of domina- 
tor paths. Table 1 shows the best, average, and worst 
percent speedup over local scheduling found for all 
functions that had a "best" speedup of over 2 percent; 
it also shows the speedup of the original implementa- 



tion of DPS and the number of distinct dominator tree 
partitionings. The original implementation of DPS 
included a single, simple heuristic to choose domina- 
tor paths. More specifically, to choose dominator paths 
within a group, G, of contiguous blocks at the same 
nesting level, the compiler continues to choose a 
block, B, to "expand." Expansion ofB initializes a new 
dominator path to include B and adds B's dominators 
until no more can be added. The algorithm then starts 
another dominator path by expanding another (as yet 
unexpanded) block of G. The first block of G chosen 
to expand is the tail block, T, in an atte mpt to obtain as 
long a dominator path as possible. 

Unfortunately, not all functions are small enough to 
be tested by performing DPS for each possible parti- 
tioning of the dominator tree. Therefore, we defined 
37 different heuristic methods of choosing dominator 
trees, based upon groupings of six key heuristic factors. 

The maximum path lengths of the basic guidelines 
were adjusted to produce actual heuristics. We used 
the heuristic factors from which the individual heuris- 
tics were constructed; each seemed likely either to 
mimic the observed characteristics of the best path 
selection or to allow more freedom of code motion 
and, therefore, more flexibility in filling "gaps." 

■ One nesting level — Group blocks from the same 
nesting level of a loop. Each block is in the same 
strongly connected component, so the blocks tend 
to have similar restrictions to code motion. For a 
group of blocks to be a strongly connected compo- 
nent, there must be some path in the control flow 
graph from each node in the component to all the 
other nodes in the component. Since the function 
will probably repeat the loop, it seems likely that 
the scheduler wili be able to overlap blocks in it. 



Table 1 

Percent of Function Speedup Improvement Using DPS Path Choices over Local Scheduling 

Percent Speedup 



Function Name 


Best 


Average 


Worst 


Original 


No. Dominator 
Tree Partitions 


bubble 


39.2 


10.6 


-0.1 


11.7 


72 


readm 


32.5 


9.3 


-0.2 


32.5 


48 


solve 


27.8 


9.9 


-0.2 


27.8 


96 


queens 


25.4 


8.3 


-0.4 


-0.4 


96 


swaprow 


23.1 


5.8 


-3.7 


19.5 


24 


print(g) 


22.0 


9.1 


-0.2 


22.0 


8 


findmax 


21.3 


6.2 


-0.3 


8.7 


18 


copycol 


18.5 


5.6 


-5.0 


19.9 


8 


elim 


14.3 


2.3 


-3.8 


10.2 


576 


mult 


13.7 


2.1 


-3.8 


10.3 


96 


subst 


12.9 


2.4 


-4.9 


4.9 


96 


print(8) 


12.5 


6.2 


0.0 


12.5 


8 
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■ Longest path — Schedule the longest available path. 
This heuristic class allows the maximum distance 
for code motion. 

■ Postdominator — Follow the postdommator relation 
in the dominator tree. When a dominator block, P, is 
succeeded by a non-postdominator block, S, our 
compiler adds P's clef set to the use set of P's 
BlockHnd node and the use set to the clef set to 
prevent any code motion from P to S. If P is instead 
succeeded by its postdominator block, no such mod- 
ification is necessary, and code would be allowed to 
move in both directions. Intuitively, the postdomina- 
tor relation is the exact inverse of the dominator rela- 
tion, so code can move down, into a postdominator, 
as it moves up into a dominator. Further, the simple 
act of adding nodes to the DDD will complicate list 
scheduling, making it harder for the scheduler to 
generate the most efficient schedule. 

■ Non-postdominator — Follow a non -postdominator 
in the dominator tree. This heuristic class generally 
means adding loop body blocks to the path. Notice 
that this seems at odds with the previous heuristic 
class. The previous class was suggested by intuition 
about the scheduler, and this one by observation of 
path behavior. 

■ iclef size — Group by iclef set size. The larger the 
/efc/'size, the more interference there is to code 
motion. A small kief size will probably allow more 
code motion, so we try to add blocks with small 
idefsYies. 

■ Density — Group by operation density. We define 
the density of each basic block as the number of 
nodes in the DDD divided by the number of instruc- 
tions required for local scheduling. A dense block 
already has close to its maximum number of opera- 
tions; adding or removing operations will probably 
not improve the schedule. For this reason, we want 
to avoid scheduling dense blocks together. Two 
methods are tried: scheduling dense blocks with 
sparse blocks and putting sparse blocks together. 

The heuristic factors were used to make individual 
heuristics by changing the limit on the possible num- 
ber of blocks in a path. It was reasonable to set limits 
for four factors: postdominator, non-postdominator, 
iclef sue, and density. We tried path length limits in 
blocks of 2, 3, 4, 5, and unlimited, making a total of 
five heuristics from each heuristic factor. 

Running DPS using each of the heuristic methods 
and comparing the efficiency of the resulting code 
leads to several conclusions about effective heuristics 
for choosing DPS's dominator paths. For some heuris- 
tics, we can achieve the best schedules for DPS by 
using paths that rarely exceed three blocks. For any 
particular class of heuristics, we can achieve the best 
schedule with paths limited to five blocks or fewer. 



Consequently, path lengths can be limited without 
lowering the efficiency of generated code, and longer 
paths, which increase scheduling time, can be avoided. 

Since no one heuristic performed well for all func- 
tions, we advise using a combination of heuristics, i.e., 
schedule by using each of three heuristics and taking 
the best schedule, The "combined" heuristic includes 
the following: 

■ Instruction density, limit to five blocks 

■ One nesting level on path, limit to five blocks 

■ Non-postdominator, unlimited length 

Frequency-based List Scheduling 

Like some other global schedulers, DPS uses a local 
scheduling algorithm (list scheduling) on a global con- 
text, namely the meta-blocks built by DPS. This algo- 
rithm raises the possibility of moving code from less 
frequently executed blocks to more frequently executed 
blocks. At first glance, this practice seems to be a bad idea. 

In theory, to best schedule any meta-block, an 
instruction scheduler must account for the differing 
cost of the instructions withi n the meta-block. If a sin- 
gle meta-block includes multiple nesting levels, the 
scheduler must recognize that instructions added to 
blocks with higher nesting levels are more costly than 
those added to blocks with lower nesting levels. Even 
within a loop, there exists the potential for consider- 
able variation in the execution frequencies of different 
blocks in the meta-block due to control flow. Of 
course variable execution frequency is not an issue in 
traditional local scheduling because, within the con- 
text of a single basic block, each DDD node: is exe- 
cuted the same number of times, namely, once each 
time execution enters the block. 

To address the issue of differing execution frequen- 
cies within meta-blocks scheduled as a single block by 
DPS, we investigated frequency- based list scheduling 
(FBLS), 15 an extension of list scheduling that provides 
an answer to this difficulty by considering that execu- 
tion frequencies differ within sections of the meta- 
blocks. FBLS uses a greedy method to place DDD nodes 
in the lowest-cost instruction possible. FBLS amends 
the basic list-scheduling algorithm by revising only the 
DDD node placement policy in an attempt to reduce 
the run-time cycles required to execute a meta-block. 

Unfortunately, although FBLS makes intuitive sense, 
we found that DPS produced worse schedules with 
FBLS than it produced with a naive local scheduling 
algorithm that ignored frequency differences within 
DPS's meta-blocks. Therefore, the current imple- 
mentation of DPS ignores the execution frequency 
differences between basic blocks, both in choosing 
dominator paths to schedule and in scheduling those 
dominator-path meta-blocks. 
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Evaluation of Dominator-path Scheduling 

To measure the potential of DPS to generate more 
efficient schedules than local scheduling for commer- 
cial superscalar architectures, we ran a small test suite 
of C programs on an Alpha 21164 server. The Alpha 
server is a superscalar architecture capable of issuing 
two integer and two floating-point instructions each 
cycle. Our compiler estimates the effectiveness of a 
schedule by modeling the 21 164 as an LIW architec- 
ture with all operation latencies known at compile 
time. Of course this model was used only within the 
compiler itself. Our results measured changes in 
21164 execution time (measured with the UNIX 
"time" command) required for each program. 

Our test suite of 14 C programs includes 8 programs 
that use integer computation only and 6 programs that 
include floating-point computation. We separated 
those groups because we see dramatic differences in 
DPS's performance when viewing integer and floating- 
point programs. To choose dominator paths, we used 
the combined heuristic recommended by Huber." 1 

Table 2 summarizes the results of tests we con- 
ducted to compare the execution times of programs 
using DPS scheduling with those using local schedul- 
ing only. The table lists the programs used in the test 
suite and the percent improvement in execution times 
for DPS-scheduled programs. The execution time 



Table 2 

Percent DPS Scheduling Improvements over Local 
Scheduling of Programs 





Percent Execution 


Program 


Time Improvement 


8- Queens 


7.3 


SymbolTable 


7.3 


BubbleSort 


5.0 


Nsieve 


6.1 


Heapsort 


6.0 


Killcache 


2.6 


TSP 


2.4 


Dhrystone 


0.7 


C integer average 


4.7 


Dice 


3.7 


Whetstone 


5.4 


Matrix Multiply 


16.2 


Gauss 


12.3 


Finite Difference 


17.6 


Livermore 


9.3 


C floating-point average 


10.8 


Overall average 


7.3 



measurements were made on an Alpha 21 164 server 
running at 250 megahertz with data cache sizes of 8 
kilobytes, 96 kilobytes, and 4 megabytes. 

Looking at Table 2, we see that, in general, DPS 
improved the integer programs less than it improved 
the floating-point programs. The range of improve- 
ments for integer programs was from 0.7 percent for 
Dhrystone to 7.3 percent each for 8-Queens and for 
SymbolTable. Summing all the improvements and 
dividing by eight (the number of integer programs) 
gives an "average" of 4.7 percent improvement for the 
integer programs. DPS improved some of the floating- 
point programs even more significantly than the inte- 
ger programs. The range of improvements for the six 
floating-point programs was from 3.7 percent for Dice 
(a simulation of rolling a pair of dice 10,000,000 times 
using a uniform random number generator) to 17.6 
percent improvement for the finite difference pro- 
gram. The average for the six floating-point programs 
was 10.8 percent. This suggests, not surprisingly, that 
the Alpha 21164 provides more opportunities for 
global scheduling improvement when floating-point 
programs are beingcompiled. 

Even within the six floating-point programs, how- 
ever, we see a distinct bi-modal behavior in terms of 
execution-time improvement. Three of the programs 
range from 12.3 percent to 17.6 percent improve- 
ment, whereas three are below 10 percent (and two of 
those significantly below 10 percent). A reason for this 
wide range is the use of global variables. Remember 
that DPS forbids the motion of global variable defini- 
tions across block boundaries. This is necessary to 
ensure correct program semantics. It is hardly a coinci- 
dence that both Dice and Whetstone include only 
global floating-point variables, whereas Livermore's 
floating-point variables are mixed about half local 
and half global, and the three better performers use 
almost no global variables. Thus we conclude that, for 
floating-point programs with few global variables, we 
can expect improvements of roughly 12 to 15 percent 
in execution time. Inclusion of global variables and 
exclusion of floating-point values will, however, 
decrease DPS's ability to improve execution time for 
the Alpha 21164. 

Related Work 

As we have discussed, local instruction scheduling can 
find parallelism within a basic block but cannot exploit 
parallelism between basic blocks. Several global sched- 
uling techniques are available, however, that extract 
parallelism from a program by moving operations 
across block boundaries and subsequently inserting 
compensation copies to maintain program semantics. 
Trace scheduling' was the first of these techniques to 
be defined. As previously mentioned, trace scheduling 
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requires compensation copies. Other "early" global 
scheduling algorithms that require compenstation 
copies include Nicolau's percolation scheduling^'" 
and Gupta's region scheduling." 1 A recent and quite 
popular extension of trace scheduling is Hwu's 
SuperBlock scheduling. I9 ' 2 * In addition to these more 
general, global scheduling methods, significant results 
have been obtained by software pipelining, which is a 
technique that overlaps iterations of loops to exploit 
available ILP. Allan et al. n provide a good summary, 
and Rau 22 provides an excellent tutorial on how modulo 
scheduling, a popular software pipelining technique, 
should be implemented. Promising recent techniques 
have focused on defining a meta-environment, which 
includes both global scheduling and software pipelin- 
ing. Moon and Ebcioglu 23 present an aggressive tech- 
nique that combines software pipelining and global 
code motion (with copies) into a single framework. 
Novak and Nicolau" describe a sophisticated schedul- 
ing framework in which to place software pipelining, 
including alternatives to modulo scheduling. While 
providing a significant number of excellent global 
scheduling alternatives, none of these techniques pro- 
vides global scheduling without the possibility of code 
expansion (copy code) as DPS does. 

To address the issue of producing schedules without 
operation copies, Bernstein 25 27 defined a technique he 
calls global instruction scheduling (GPS) that allows 
movement of instructions beyond block boundaries 
based upon the program dependence graph (PDG). 2S In 
a test suite of four programs run on IBM's RS/6000, 
Bernstein's method showed improvement of roughly 
7 percent over local scheduling for two of the programs, 
with no significant difference for die others. 

Comparing DPS to Bernstein's method, we see that 
both allow for interblock motion without copies. 
Bernstein also allows for interblock movement requir- 
ing duplicates that DPS does not. Interestingly, 
Bernstein's later work 27 does not make use of this abil- 
ity to allow motion that requires duplication of opera- 
tions, suggesting that, to date, he has not found such 
motion advisable for the RS/6000 architecture to 
which his techniques have been applied. Bernstein 
allows operation movement in only one direction, 
whereas DPS allows operations to move from a domi- 
nator block to a postdominator. This added flexibility is 
an advantage to DPS. Of possibly greater significance, 
DPS uses the local instruction scheduler to place opera- 
tions. Bernstein uses a separate set of heuristics to move 
operations in the PDG and then uses a subsequent local 
scheduling pass to order operations within each block. 
Fisher' argues that incorporating movement of opera- 
tions with the scheduling phase itself provides better 
scheduling than dividing the interblock motion and 
scheduling phases. Based on that criterion alone, DPS 
has some advantages over Bernestein's method. 



Conclusions 

It is commonly accepted that to exploit the perfor- 
mance benefits of ILP, global instruction scheduling is 
required. Several varieties of global instruction sched- 
uling exist, most requiring compensation copies to 
ensure proper program semantics when operations 
cross block boundaries during instruction scheduling. 
Although such global scheduling with compensation 
copies may be an effective strategy for architectures 
with large degrees of ILP, another approach seems 
reasonable for more limited architectures, such as cur- 
rently available superscalar computers. 

This paper outlines DPS, a global instruction sched- 
uling technique that does not require compensation 
copies. Based on the fact that more than 25 percent of 
intermediate statements can be moved upward at least 
one dominator block in the control flow graph with- 
out changing program semantics, DPS schedules paths 
in a function's dominator tree as meta-blocks, making 
use of an extended local instruction scheduler to 
schedule dominator paths. 

Experimental evidence shows that DPS does indeed 
produce more efficient schedules than local schedul- 
ing for Compaq's Alpha 2 1 1 64 server system, particu- 
larly for floating-point programs that avoid the use of 
global variables. This work has demonstrated that con- 
siderable flexibility in placement of code is possible 
even when compensation copies are not allowed. 
Although more research is required to look into 
possible uses for this flexibility, the global instruction 
scheduling method described here (DPS) shows 
promise for ILP architectures. 
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Parallelizing compilers for multiprocessors face 
many hurdles. However, SUIF's robust analysis 
and memory optimization techniques enabled 
speedups on three fourths of the NAS and 
SPECfp95 benchmark programs. 
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The affordability of shared memory multiprocessors 
offers the potential of supercomputer-class performance 
to the general public. Typically used in a multiprogram- 
ming mode, these machines increase throughput by 
running several independent applications in parallel. 
But multiple processors can also work together to 
speed up single applications. This requires that ordinary 
sequential programs be rewritten to take advantage of 
the extra processors. 1 4 Automatic parallelization with a 
compiler offers a way to do this. 

Parallelizing compilers face more difficult challenges 
from multiprocessors than from vector machines, which 
were their initial target. Using a vector architecture effec- 
tively involves parallelizing repeated arithmetic opera- 
tions on large data streams — for example, the innermost 
loops in array-oriented programs. On a multiprocessor, 
however, this approach typically does not provide suffi- 
cient granularity of parallelism: Not enough work is 
performed in parallel to overcome processor synch- 
ronization and communication overhead. To use a 
multiprocessor effectively, the compiler must exploit 
coarse-grain parallelism, locating large computations 
that can execute independently in parallel. 

Locating parallelism is just the first step in produc- 
ing efficient multiprocessor code. Achieving high per- 
formance also requires effective use of the memory 
hierarchy, and multiprocessor systems have more com- 
plex memory hierarchies than typical vector machines: 
They contain not only shared memory but also multi- 
ple levels of cache memory. 

These added challenges often limited the effectiveness 
of early parallelizing compilers for multiprocessors, so 
programmers developed their applications from scratch, 
without assistance from tools. Rut explicitly managing an 
application's parallelism and memory use requires a great 
deal of programming knowledge, and the work is tedious 
and error-prone. Moreover, the resulting programs are 
optimized for only a specific machine. Thus, die effort 
required to develop efficient parallel programs restricts 
the user base for multiprocessors. 

This article describes automatic parallelization tech- 
niques in the SUIF (Stanford University Intermediate 
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Format) compiler that result in good multiprocessor 
performance for array-based numerical programs. We 
provide SUIF performance measurements for die com- 
plete NAS and SP£Cfp95 benchmark suites. Overall, the 
results for these scientific programs are promising. The 
compiler yields speedups on three fourths of the pro- 
grams and has obtained the highest ever performance on 
the SPECfp95 benchmark, indicating that the compiler 
can also achieve efficient absolute performance. 

Finding Coarse-grain Parallelism 

Multiprocessors work best when the individual proces- 
sors have large units of independent computation, but 
it is not easy to find such coarse-grain parallelism. First 
the compiler must find available parallelism across pro- 
cedure boundaries. Furthermore, the original compu- 
tations may not be parallelizable as given and may first 
require some transformations. For example, experience 
in parallelizing by hand suggests that we must often 
replace global arrays with private versions on different 
processors. In other cases, the computation may 
need to be restructured — for example, we may have to 
replace a sequential accumulation with a parallel reduc- 
tion operation. 

It takes a large suite of robust analysis techniques to 
successfully locate coarse-grain parallelism. General 
and uniform frameworks helped us manage the com- 
plexity involved in building such a system into SUIF. 
We automated the analysis to privatize arrays and to 
recognize reductions to both scalar and array variables. 
Our compiler's analysis techniques all operate seam- 
lessly across procedure boundaries. 

Scalar Analyses 

An initial phase analyzes scalar variables in the programs. 
It uses techniques such as data dependence analysis, 
scalar privatization analysis, and reduction recognition 
to detect parallelism among operations with scalar vari- 
ables. It also derives symbolic information on these scalar 
variables that is useful in the array analysis phase. Such 
information includes constant propagation, induction 
variable recognition and elimination, recognition of 
loop-invariant computations, and symbolic relation 
propagation. vr ' 

Array Analyses 

An array analysis phase uses a unified mathematical 
framework based on linear algebra and integer linear 
programming.' The analysis applies the basic data 
dependence test to determine if accesses to an array 
can refer to the same location. To support array priva- 
tization, it also finds array dataflow information that 
determines whether array elements used in an iteration 
refer to the values produced in a previous iteration. 



Moreover, it recognizes commutative operations on 
sections of an array and transforms them into parallel 
reductions. The reduction analysis is powerful enough 
to recognize commutative updates of even indirectly 
accessed array locations, allowing parallelization of 
sparse computations. 

All these analyses are formulated in terms of integer 
programming problems on systems of linear inequali- 
ties that represent the data accessed. These inequalities 
are derived from loop bounds and array access func- 
tions. Implementing optimizations to speed up com- 
mon cases reduces the compilation time. 

Interprocedural Analysis Framework 

All the analyses are implemented using a uniform 
interprocedural analysis framework, which helps man- 
age the software engineering complexity. The frame- 
work uses interprocedural dataflow analysis/ which is 
more efficient than the more common technique of 
inline substitution. 1 Inline substitution replaces each 
procedure call with a copy of the called procedure, 
then analyzes the expanded code in the usual intrapro- 
cedural manner. Inline substitution is not practical for 
large programs, because it can make the program too 
large to analyze. 

Our technique analyzes only a single copy of each 
procedure, capturing its side effects in a function. This 
function is then applied at each call site to produce- 
precise results. When different calling contexts make it 
necessary, the algorithm selectively clones a procedure 
so that code can be analyzed and possibly parallelized 
under different calling contexts (as when different 
constant values are passed to the same formal parame- 
ter). In this way the full advantages of miming are 
achieved without expanding the code indiscriminately. 

In Figure 1 the boxes represent procedure bodies, 
and the lines connecting them represent procedure 
calls. The main computation is a series of four loops to 
compute three-dimensional fast Fourier transforms. 
Using interprocedural scalar and array analyses, the 
SUIF compiler determines that these loops are paral- 
lelizable. Each loop contains more than 500 lines of 
code spanning up to nine procedures with up to 42 
procedure calls. If this program had been fully in lined, 
the loops presented to the compiler for analysis would 
have each contained more than 86,000 lines of code. 

Memory Optimization 

Numerical applications on high-performance micro- 
processors are often memory bound. Even with one or 
more levels of cache to bridge the gap between proces- 
sor and memory speeds, a processor may still waste half 
its timestalled on memory accesses because it frequently 
references an item not in the cache (a cache miss). This 
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Figure 1 

The compiler discovers parallelism through interprocedural array analysis. Each of the four parallelized loops at left consists of 
more than 500 lines of code spanning up to nine procedures (boxes) with up to 42 procedure calls (lines). 



memory bottleneck is further exacerbated on multi- 
processors by their greater need for memory traffic, 
resulting in more contention on the memory bus. 

An effective compiler must address four issues that 
affect cache behavior: 

■ Communication: Processors in a multiprocessor 
system communicate through accesses to the same 
memory location. Coherent caches typically keep 
the data consistent by causing accesses to data writ- 
ten by another processor to miss in the cache. Such 
misses are called true sharing misses. 

■ Limited capacity: Numeric applications tend to have 
large working sets, which typically exceed cache 
capacity. These applications often stream through 
large amounts of data before reusing any of it, 
resulting in poor temporal locality and numerous 
capacity misses. 

■ Limited associativity: Caches typically have a small 
set associativity; that is, each memory location can 
map to only one or just a few locations in the cache. 
Conflict misses — when an item is discarded and 
later retrieved — can occur even when the applica- 
tion's working set is smaller than the cache, if the 
data are mapped to die same cache locations. 

■ Large line size: Data in a cache are transferred in 
fixed-size units called cache lines. Applications that 
do not use all the data in a cache line incur more 
misses and are said to have poor spatial locality. On 
a multiprocessor, large cache lines can also lead to 
cache misses when different processors use differ- 



ent parts of the same cache line. Such misses are 
called false sharing misses. 

The compiler tries to eliminate as many cache misses as 
possible, then minimize the impact of any that remain by 

■ ensuring that processors reuse the same data as 
many times as possible and 

■ making the data accessed by each processor con- 
tiguous in the shared address space. 

Techniques for addressing each of these subproblems 
are discussed below. Final ly, to tolerate the latency of 
remaining cache misses, the compiler uses compiler- 
inserted prefetching to move data into the cache before 
it is needed. 

Improving Processor Data Reuse 

The compiler reorganizes the computation so that each 
processor reuses data to the greatest possible extent. 7 ~' ; 
This reduces the working set on each processor, 
thereby minimizing capacity misses. It also reduces 
interprocessor communication and thus minimizes 
true sharing misses. Toacltieve optimal reuse, the com- 
piler uses affine partitioning. This technique analyzes 
reference patterns in the program to derive an affine 
mapping (linear transformation plus an offset) of the 
computation of the data to the processors. The affine 
mappings are chosen to maximize a processor's reuse 
of data while maintaining sufficient parallelism to keep 
all processors busy. The compiler also uses loop block- 
ing to reorder the computation executed on a single 
processor so that data is reused in the cache. 
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Making Processor Data Contiguous 

The compiler tries to arrange the data to make a 
processor's accesses contiguous in the shared address 
space. This improves spatial locality while reducing 
conflict misses and false sharing. SUIF can manage 
data placement within a single array and across multi- 
ple arrays. The data-to-processor mappings computed 
by the affine partitioning analysis are used to deter- 
mine the data being accessed by each processor. 

Figure 2 shows how the compiler's use of data per- 
mutation and data strip-mining 1 " can make contiguous 
the data within a single array that is accessed by one 
processor. Data permutation interchanges the dimen- 
sions of the array — for example, transposing a two- 
dimensional array. Data strip-mining changes an 
array's dimensionality so that all data accessed by the 
same processor are in the same plane of the array. 

To make data across multiple arrays accessed by the 
same processor contiguous, we use a technique called 
compiler-directed page coloring." The compiler uses 



its knowledge of the access patterns to direct the oper- 
ating system's page allocation policy to make each 
processor's data contiguous in the physical address 
space. The operating system uses these hints to deter- 
mine the virtual-to-physical page mapping at page 
allocation time. 

Experimental Results 

We conducted a series of performance evaluations to 
demonstrate the impact of SUIF's analyses and opti- 
mizations. We obtained measurements on a Digital 
AlphaServer 8400 with eight 21 164 processors, each 
with two levels of on-chip cache and a 4-Mbyte exter- 
nal cache. Because speedups are harder to obtain on 
machines with fast processors, our use of a state-of- 
the-art machine makes the results more meaningful 
and applicable to future systems. 

We used two complete standard benchmark suites 
to evaluate our compiler. We present results for the 10 
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STRIP-MINING PERMUTATION 



Figure 2 

Data transformations can make the data accessed by each processor contiguous in the shared address space. In the two 
examples above, the original arrays are two-dimensional; the axes are identified to show that elements along the first axis 
are contiguous. First the affine partitioning analysis determines which data elements are accessed by the same processor 
(the shaded elements are accessed by the first processor.) Second, data strip-mining turns the 2D array into a 3D array, 
with the shaded elements in the same plane. Finally, applying data permutation rotates the array, making data accessed 
by each processor contiguous. 
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programs in the SPECfp95 benchmark suite, which is 
commonly used for benchmarking uniprocessors. We 
also used the eight official benchmark programs from 
the NAS parallel-system benchmark suite, except for 
embar; here we used a slightly modified version from 
Applied Parallel Research. 

Figure 3 shows the SPECfp95 and NAS speedups, 
measured on up to eight processors on a 300-MHz 
AlphaServer. We calculated the speedups over die best 
sequential execution time from either officially reported 
results or our own measurements. Note that mgrid and 
applu appear in both benchmark suites (the program 
source and data set sizes differ slightly). 

To measure the effects of the different compiler 
techniques, we broke down the performance obtained 
on eight processors into three components. In Figure 
4, baseline shows the speedup obtained with paral- 
lelization using only intraprocedural data dependence 
analysis, scalar privatization, and scalar reduction 
transformations. Coarse grain includes the baseline 



techniques as well as techniques for locating coarse- 
grain parallel loops — for example, array privatization 
and reduction transformations, and full interproce- 
dural analysis of both scalar and array variables. 
Memory includes the coarse-grain techniques as well 
as the multiprocessor memory optimizations we 
described earlier. 

Figure 3 shows that of the 1 8 programs, 1 3 show good 
parallel speedup and can thus take advantage of additional 
processors. SUIF's coarse-grain techniques and memory 
optimizations significantly affect the performance of half 
the programs. The swim and tomcatv programs show 
superlinear speedups because the compiler eliminates 
almost all cache misses and their 14 Mbyte working sets 
fit into the multiprocessor's aggregate cache. 

For most of die programs that did not speed up, the 
compiler found much of their computation to be par- 
allelizabJe, but the granularity is too fine to yield good 
multiprocessor performance on machines with fast 
processors. Only two applications, fpppp and buk, have 
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Figure 3 

SUIF compiler speedups over the best sequential time achieved on the (a) SPECfp95 and (b) NAS parallel benchmarks. 
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Figure 4 

The speedup achieved on eight processors is broken down into three components to show how SUIF's memory optimization 
and discovery of coarse-grain parallelism affected performance. 



no statically analyzable loop-level parallelism, so they 
are not amenable to our techniques. 

Table 1 shows the times and SPEC ratios obtained 
on an eight-processor, 440-MHz Digital AlphaServer 
8400, testifying to our compiler's high absolute per- 
formance. The SPEC ratios compare machine perfor- 
mance with that of a reference machine. (These are 
not official SPEC ratings, which among other things 



require that the software be generally available. The 
ratios we obtained are nevertheless valid in assessing 
our compiler's performance.) The geometric mean of 
the SPEC ratios improves over the uniprocessor execu- 
tion by a factor of 3 with four processors and by a fac- 
tor of 4.3 with eight processors. Our eight-processor 
ratio of 63.9 represents a 50 percent improvement 
over the highest number reported to date. 12 



Table 1 

Absolute Performance forthe SPECfp95 Benchmarks Measured on a 440-MHz Digital AlphaServer Using One 
Processor, Four Processors, and Eight Processors 



Execution Time (sees) SPEC Ratio 



Benchmark 


1P 


4P 


8P 


1P 


4P 


8P 


tomcatv 


219.1 


30.3 


18.5 


16.9 


122.1 


200.0 


swim 


297.9 


33.5 


17.2 


28.9 


256.7 


500.0 


su2cor 


155.0 


44.9 


31.0 


9.0 


31.2 


45.2 


hydro2d 


249.4 


61.1 


40.7 


9.6 


39.3 


59.0 


mgrid 


185.3 


42.0 


27.0 


13.5 


59.5 


92.6 


applu 


296.1 


85.5 


39.5 


7.4 


25.7 


55.7 


turb3d 


267.7 


73.6 


43.5 


15.3 


55.7 


94.3 


apsi 


137.5 


141.2 


143.2 


15.3 


14.9 


14.7 


fpppp 


331.6 


331.6 


331.6 


29.0 


29.0 


29.0 


waveS 


151.8 


141.9 


147.4 


19.8 


21.1 


20.4 


Geometric Mean 








15.0 


44.4 


63.9 
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Editors ' Note: With the following section, the authors 
provide an update on the status of the SUIF compiler 
since the publication of their paper in Computer in 
December 1996. 

Addendum: The Status and Future of SUIF 

Public Availability of SUIF-parallelized Benchmarks 

The SUIF-paraJlelized versions of the SPECfp95 
benchmarks used for the experiments described in this 
paper have been released to the SPEC committee and 
are available to any license holders of SPEC (see 
http:/ /ww\v.specbench.org/osg/cpu95/par-research). 
This benchmark distribution contains the SUIF out- 
put (C and FORTRAN code), along with the source 
code for die accompanying run-time libraries. We expect 
these benchmarks will be useful for two purposes: 
(1) for technology transfer, providing insight into how 
the compiler transforms the applications to yield the 
reported results; and (2) for further experimentation, 
such as in architecture-simulation studies. 

The SUIF compiler system itself is available from the 
SUIF web site at http://vvww-suif.stanford.edu. This 
system includes only the standard parallelization analy- 
ses that were used to obtain our baseline results. 

New Parallelization Analyses i n S UIF 

Overall, the results of automatic parallelization reported 
in this paper are impressive; however, a few applica- 
tions either do not speed up at all or achieve limited 
speedup at best. The question arises as to whether 
SUIF is exploiting all the available parallelism in these 
applications. Recently, an experiment to answer this 
question was performed in which loops left unparal- 
lelized by SUIF were instrumented with run-time tests 
to determine whether opportunities for increasing the 
effectiveness of automatic parallelization remained in 
these programs.' Run-time testing determined that 
eight of the programs from the NAS and SPEC95fp 
benchmarks had additional parallel loops, for a total of 
69 additional parallelizable loops, which is less than 5% 
of the total number of loops in these programs. Of 
these 69 loops, the remaining parallelism had a signifi- 
cant effect on coverage (the percentage of the pro- 
gram that is parallelizable) or granularity (the size of 
the parallel regions) in only four of the programs: apsi, 
su2cor, wave5, and fftpde. 

We found that almost all the significant loops in 
these four programs could potentially be parallelized 
using a new approach that associates predicates with 
array data-flow values. 2 Instead of producing conserv- 
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ative results that hold for all control-flow paths and all 
possible program inputs, predicated array data-flow 
analysis can derive optimistic results guarded by predi- 
cates. Predicated array data-flow analysis can lead to 
more effective automatic parallelization in three ways: 
(1) It improves compile-time analysis by ruling out 
infeasible control-flow paths. (2) It provides a frame- 
work for the compiler to introduce predicates that, if 
proven true, would guarantee safety for desirable data- 
flow values. ( 3) It enables the compiler to derive low-cost 
run-time parallelization tests based on the predicates 
associated with desirable data-flow values. 

SUIFand Compaq's GEM Compiler 

The GEM compiler system is the technology Compaq 
has been using to build compiler products for a variety 
of languages and hardware/software platforms. 3 
Within Compaq, work has been done to connect SUIF 
with the GEM compiler. SUIF's intermediate repre- 
sentation was converted into GEM's intermediate rep- 
resentation, so that SUIF code can be passed directly 
to GEM's optimizing back end. This eliminates the 
loss of information suffered when SUIF code is trans- 
lated to C/FORTRAN source before it is passed to 
GEM. It also enables us to generate more efficient 
code for Alpha-microprocessor systems. 

SUIF and the National Compiler Infrastructure 

The SUIF compiler system was recently chosen to be 
part of the National Compiler Infrastructure (NCI) 
project funded by the Defense Advanced Research 
Projects Agency (DARPA) and the National Science 
Foundation (NSF). The goal of the project is to 
develop a common compiler platform for researchers 
and to facilitate technology transfer to industry. The 



SUIF component of the NCI project is the result of the 
collaboration among researchers in five universities 
(Harvard University, Massachusetts Institute of 
Technology, Rice University, Stanford University, 
University of California at Santa Barbara) and one 
industrial partner, Portland Group Inc. Compaq is a 
corporate sponsor of the project and is providing the 
FORTRAN front end. 

A revised version of the SUIF infrastructure (SUIF 
2.0) is being released as part of the SUIF NCI project 
(a preliminary version of SUIF 2.0 is available at the 
SUIF web site). The completed system will be 
enhanced to support parallelization, intcrprocedural 
analysis, memory hierarchy optimizations, objected- 
oriented programming, scalar optimizations, and 
machine-dependent optimizations. An overview of 
the SUIF NCI system is shown in Figure Al. See 
www-suif. stanford.edu/suif/NCI/suif.html for more 
information about SUIF and the NCI project, includ- 
ing a complete list of optimizations and a schedule. 
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Effective user debugging of optimized code has 
been a topic of theoretical and practical interest 
in the software development community for 
almost two decades, yet today the state of the 
art is still highly uneven. We present a brief sur- 
vey of the literature and current practice that 
leads to the identification of three aspects of 
debugging optimized code that seem to be 
critical as well as tractable without extraordi- 
nary efforts. These aspects are (1) split lifetime 
support for variables whose allocation varies 
within a program combined with definition 
point reporting for currency determination, 
(2) stepping and setting breakpoints based on 
a semantic event characterization of program 
behavior, and (3) treatment of inlined routine 
calls in a manner that makes inlining largely 
transparent. We describe the realization of 
these capabilities as part of Compaq's GEM 
back-end compiler technology and the debug- 
ging component of the OpenVMS Alpha oper- 
ating system. 



Introduction 

In software development, it is common practice to 
debug a program that has been compiled with little or 
no optimization applied. The generated code closely 
corresponds to the source and is readily described by a 
simple and straightforward debugging symbol table. A 
debugger can interpret and control execution of the 
code in a fashion close to the user's source-level view 
of the program. 

Sometimes, however, developers find it necessary or 
desirable to debug an optimized version of the pro- 
gram. For instance, a bug — whether a compiler bug or 
incorrect source code — may only reveal itself when 
optimization is applied. In other cases, the resource 
constraints may not allow the unoptimized form to be 
used because the code is too big and/or too slow. Or, 
the developer may need to start analysis using the 
remains, such as a core file, of die failed program, 
whether or not this code has been optimized. Whatever 
the reason, debugging optimized code is harder than 
debugging unoptimized code — much harder — because 
optimization can greatly complicate the relationship 
between the source program and the generated code. 

Zellweger 1 introduced the terms expected behavior 
and truthful behavior when referring to debugging 
optimized code. A debugger provides expected behav- 
ior if it provides the behavior a user would experience 
when debugging an unoptimized version of a pro- 
gram. Since achieving that behavior is often not possi- 
ble, a secondary goal is to provide at least truthful 
behavior, that is, to never lie to or mislead a user. In 
our experience, even truthful behavior can be chal- 
lenging to achieve, but it can be closely approached. 

This paper describes three improvements made to 
Compaq's GEM back-end compiler system and to 
OpenVMS DEBUG, the debugging component of the 
OpenVMS Alpha operating system. These improve- 
ments address 

1. Split lifetime variables and currency determination 

2. Semantic events 

3. Inlining 
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Before presenting the details of this work, we dis- 
cuss die alternative approaches to debugging optimized 
code that we considered, the state of the art, and the 
operating strategies we adopted. 

Alternative Approaches 

Various approaches have been explored to improve 
the ability to debug optimized code. They include 
the following: 

■ Enhance debugger analysis 

■ Limit optimization 

■ Limit debugging to preplanned locations 

■ Dynamically deoptimize as needed 

■ Exploit an associated program database 

We touch on these approaches in turn. 

In probably the oldest theoretical analysis that 
supports debugging optimized code, Hennessy 2 stud- 
ies whether the value displayed for a variable is current, 
that is, the expected value for that variable at a given 
point in the program. The value displayed might not 
be current because, for example, assignment of a later 
value has been moved forward or the relevant assign- 
ment has been delayed or omitted. Hennessy postu- 
lates that a flow graph description of a program is 
communicated to the debugger, which then solves 
certain flow analysis equations in response to debug 
commands to determine currency as needed. 
Copperman 3 takes a similar though much more gen- 
eral approach. Conversely, commercial implementa- 
tions have favored more complete preprocessing of 
information in the compiler to enable simpler debug- 
ger mechanisms. 4 6 

If optimization is the "problem," then one approach 
to solving the problem is to limit optimization to only 
those kinds that are actually supported in an available 
debugger. Zurawski 7 develops the notion of a recovery 
function that matches each kind of optimization. As an 
optimization is applied during compilation, the com- 
pensating recovery function is also created and made 
available for later use by a debugger. If such a recovery 
function cannot be created, then the optimization is 
omitted. Unfortunately, code-motion-related optimi- 
zations generally lack recovery functions and so must 
be foregone. Taking this approach to the extreme 
converges with traditional practice, which is simply to 
disable all optimization and debug a completely unop- 
timized program. 

If full debugger functionality need only be provided 
at some locations, then some debugger capabilities can 
be provided more easily. Zurawski 7 also employed this 
idea to make it easier to construct appropriate recov- 
ery functions. This approach builds on a language- 
dependent concept of inspection points, which 



generally must include all call sites and may corre- 
spond to most statement boundaries. His experience 
suggests, however, that even limiting inspection points 
to statement boundaries severely limits almost all kinds 
of optimization. 

Holzle et al. s describe techniques to dynamically 
deoptimize part of a program (replace optimized code 
with its unoptimized equivalent) during debugging to 
enable a debugger to perform requested actions. They 
make the technique more tractable, in part by delaying 
asynchronous events to well-defined interruption 
points, generally backward branches and calls. Opti- 
mization between interruption points is unrestricted. 
However, even tliis choice of interruption points 
severely limits most code motion and many other 
global optimizations. 

Pollock and others 91 " use a different kind of deopti- 
mization, which might be called preplanned, incre- 
mental deoptimization. During a debugging session, 
any debugging requests that cannot be honored 
because of optimization effects are remembered so 
that a subsequent compilation can create an exe- 
cutable that can honor these requests. This scheme is 
supported by an incremental optimizer that uses a pro- 
gram database to provide rapid and smooth forward 
information flow to subsequent debugging sessions. 

Feiler" uses a program database to achieve the bene- 
fits of interactive debugging while applying as much 
static compilation technology as possible. He describes 
techniques for maintaining consistency between the 
primary tree-based representation and a derivative 
compiled form of the program in the face of borii 
debugging actions and program modifications on-the- 
fly. While he appears to demonstrate that more is possi- 
ble than might be expected, substantial limitations still 
exist on debugging capability, optimization, or both. 

A comprehensive introduction and overview to these 
and other approaches can be found in Copperman 3 and 
AdJ-Tabatabi. 12 In addition, "An Annotated Biblio- 
graphy on Debugging Optimized Code" is available 
separately on the Digital Technical Journal web site at 
http://www.digital.com/info/DTJ. This bibliography 
cites and summarizes the entire Literature on debugging 
optimized code as best we know it. 

State of the Art 

When we began our work in early 1994, we assessed 
the level of support for debugging optimized code 
that was available with competitive compilers. Because 
we have not updated this assessment, it is not appro- 
priate for us to report the results here in detail. We do 
however summarize the methodology used and the 
main results, which we believe remain generally valid. 

We created a series of example programs that pro- 
vide opportunities for optimization of a particular kind 
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or of related kinds, and which could lead a traditional 
debugger to deviate from expected behavior. We com- 
piled and executed these programs under the control 
of each system's debugger and recorded how the sys- 
tem handled the various kinds of optimization. The 
range of observed behaviors was diverse. 

At one extreme were compilers that automatically 
disable all optimization if a debugging symbol table is 
requested (or, equivalently for our purposes, give an 
error if both optimization and a debugging symbol 
table are requested). For these compilers, the whole 
exercise becomes moot; that is, attempting to debug 
optimized code is not allowed. 

Some compiler/debugger combinations appeared 
to usefully support some of our test cases, although 
none handled all of them correctly. In particular, none 
seemed able to show a traceback of subroutine calls 
that compensated for inlining of routine calls and all 
seemed to produce a lot of jitter when stepping by line 
on systems where code is highly scheduled. 

The worst example that we found allowed compila- 
tion using optimization but produced a debugging 
symbol table that did not reflect the results of that opti- 
mization. For example, local variables were described 
as allocated on the stack even though the generated 
code clearly used registers for diese variables and never 
accessed any stack locations. At debug time, a request 
to examine such a variable resulted in the display of die 
irrelevant and never-accessed stack locations. 

The bottom line from this analysis was very clear: 
the state of the art for support of debugging opti- 
mized code was generally quite poor. DIGITAL'S 
debuggers, including OpenVMS DEBUG, were not 
unusual in this regard. The analysis did indicate some 
good examples, though. Both the CONVEX CXdb' 1 ' 5 
and the HP 9000 DOC 6 systems provide many valu- 
able capabilities. 

Biases and Goals 

Early in our work, we adopted the following strategies: 

■ Do not limit or compromise optimization in any way. 

■ Stay within the framework of the traditional edit- 
compile-link-debug cycle. 

■ Keep the burden of analysis within the compiler. 

The prime directive for Compaq's GEM-based 
compilers is to achieve the highest possible perfor- 
mance from the Alpha architecture and chip technol- 
ogy'. Any improvements in debugging such optimized 
code should be useful in the face of the best that a 
compiler has to offer. Conversely, if a programmer has 
the luxury of preparing a less optimized version for 
debugging purposes, there is little or no reason for 
that version to be anything other than completely 



unoptimized. There seems to be no particular benefit 
to creating a special intermediate level of combined 
debugger/optimization support. 

Pragmatically, we did not have the time or staffing 
to develop a new optimization framework, for exam- 
ple, based on some kind of program database. Nor 
were we interested in intruding into those parts of the 
GEM compiler that performed optimization to create 
more complicated options and variations, which might 
be needed for dynamic deoptimization or recovery 
function creation. 

Finally, it seemed sensible to perform most analysis 
activities within the compiler, where the most complete 
information about the program is already available. It is 
conceivable that passing additional information from 
the compiler to the debugger using the object file 
debugging symbol table might eventually tip the bal- 
ance toward performing more analysis in the debugger 
proper. The available size data (presented later in this 
paper in Table 3) do not indicate this. 

We identified three areas in which we felt enhanced 
capabilities would significantly improve support for 
debugging optimized code. These areas are 

1 . The handling of split lifetime variables and currency 
determination 

2. The process of stepping though the program 

3. The handling of procedure inlining 

In the following sections we present the capabilities we 
developed in each of these areas together with insight 
into the implementation techniques employed. 

First, we review the GEM and OpenVMS DEBUG 
framework in which we worked. The next three sec- 
tions address the new capabilities in turn. The last 
major section explores the resource costs (compile- 
time size and performance, and object and image 
sizes) needed to realize these capabilities. 

Starting Framework 

Compaq's GEM compiler system and the OpenVMS 
DEBUG component of the OpenVMS operating 
system provide the framework for our work. A brief 
description of each follows. 

GEM 

The GEM compiler system 15 is the technology 
Compaq is using to build state-of-the-art compiler 
products for a variety of languages and hardware and 
software platforms. The GEM system supports a range 
of languages (C, C++, FORTRAN including HPF, 
Pascal, Ada, COBOL, BLISS, and others) and has been 
successfully retargeted and rehosted for the Alpha, 
MIPS, and Intel IA-32 architectures and for the 
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OpenVMS, DIGITAL UNIX, Windows NT, and 
Windows 95 operating systems. 

The major components of a GEM compiler are the 
front end, the optimizer, the code generator, the final 
code stream optimizer, and the compiler shell. 

■ The front end performs lexical analysis and parsing 
of the source program. The primary outputs are 
intermediate language (IL) graphs and symbol 
tables. Front ends for all source languages translate 
to the same common representation. 

■ The optimizer transforms the IL generated by the 
front end into a semantically equivalent form that 
will execute faster on the target machine. A signifi- 
cant technical achievement is that a single optimizer 
is used for all languages and target platforms. 

■ The code generator translates the IL into a list of 
code cells, each of which represents one machine 
instruction for the target hardware. Virtually all the 
target machine instruction-specific code is encapsu- 
lated in the code generator. 

■ The final phase performs pattern-based peephole 
optimizations followed by instruction scheduling. 

■ The shell is a portable interface to the external envi- 
ronment in which the compiler is used. It provides 
common compiler functions such as listing genera- 
tors, object file emitters, and command line proces- 
sors in a form that allows the other components to 
remain independent of the operating system. 

The bulk of the GEM implementation work described 
in this paper occurs at the boundary between the final 
phase and the object file output portion of the shell. A 
new debugging optimized code analysis phase exam- 
ines the generated code stream representation of the 
program, together with the compiler symbol table, to 
extract the information necessary to pass on to a 
debugger through the debugging symbol table. Most 
of the implementation is readily adapted to different 
target architectures by means of the same instruction 
property tables that are used in the code generator and 
final optimizer. 

OpenVMS DEBUG 

The OpenVMS Alpha debugger, originally developed 
for the OpenVMS VAX system, u is a full-function, 
source-level, symbolic debugger. It supports symbolic 
debugging of programs written in BLISS, MACRO-32, 
MACRO-64, FORTRAN, Ada, C, C++, Pascal, PL/1, 
BASIC, and COBOL. The debugger allows the user to 
control the execution and to examine the state of a 
program. Users can 

■ Set breakpoints to stop at certain points in the program 

■ Step through the execution of the program a line at 
a time 



■ Display the source-level view of the program's exe- 
cution using either a graphical user interface or a 
character- based user interface 

■ Examine user variables and hardware registers 

■ Display a stack traceback showing the current call 
stack 

■ Set watch points 

■ Perform many other functions'' 

Split Lifetime Variables and Currency 
Determination 

Displaying (printing) the value of a program variable is 
one of the most basic services that a debugger can pro- 
vide. For unoptimized code and traditional debug- 
gers, the mechanisms for doing this are generally 
based on several assumptions. 

1. A variable has a single allocation that remains fixed 
throughout its lifetime. For a local or a stack-allocated 
variable that means throughout the lifetime of the 
scope in which the variable is declared. 

2. Definitions and uses of the values of user variables 
occur in the same order in the generated code as 
they do in the original program source. 

3. The set of instructions that belong to a given scope 
(which maybe a routine body) can be described by 
a single contiguous range of addresses. 

The first and second assumptions are of interest in this 
discussion because many GEM optimizations make 
them inappropriate. Split lifetime optimization (dis- 
cussed later in this section) leads to violation of the first 
assumption. Code motion optimization leads to viola- 
tion of the second assumption and thereby creates the 
so-called currency problem. We treat both of these prob- 
lems together, and we refer to them collectively as split 
lifetime support. Statement and instruction scheduling 
optimization leads to violation of the third assumption. 
This topic is addressed I ater, in the section In lining. 

Split Lifetime Variable Definition 

A variable is said to have split lifetimes if the set of 
fetches and stores of the variable can be partitioned 
such that none of the values stored in one subset are 
ever fetched in another subset. When such a partition 
exists, the variable can be "split" into several indepen- 
dent "child" variables, each corresponding to a parti- 
tion. As independent variables, the child variables can 
be allocated independently. The effect is that the 
original variable can be thought to reside in different 
locations at different points in time — sometimes in a 
register, sometimes in memory, and sometimes 
nowhere at all. Indeed, it is even possible for the differ- 
ent child variables to be active simultaneously. 
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Split Lifetime Example A simple example of a split 
lifetime variable can be seen in the following straight- 
line code fragment: 



A = . . . ; ! Define (assign value to) A 

B = ...A...; ! Use definition (value of) A 

A = . . . ! Define A again 

C = ... A ... ; i Use latter definition A 



In this example, the first value assigned to variable A is 
used later in the assignment to variable B and then 
never used again. A new value is assigned to A and 
used in the assignment to variable C. 

Without changing the meaning of this fragment, we 
can rewrite the code as 

Al • . . .; ! Define Al 

B = .. .Al...; ! Use Al 
A2 m . .,; ! Define A2 

C = ... A? ... ; ! Use A2 

where variables A\ and A2 are split child variables of A. 

Because A\ and A2 are independent, the following 
is also an equivalent fragment: 

Al = . . . ; I Define Al 

A2 = . . . ; ! Def ine A2 

B = ... A 1 ... ; ! Use Al 

C = . . . A2 . . . ; ! Use A2 

Here, we see that the value of A2 is assigned while the 
value of ,41 is still alive. That is, the split children of a 
single variable have overlapping lifetimes. 

This example illustrates that split lifetime optimi- 
zation is possible even in simple straight-line code. 
Moreover, other optimizations can create opportuni- 
ties for split lifetime optimization that may not be 
apparent from casual examination of the original 
source. In particular, loop unrolling (in which the 
body of a loop is replicated several times in a row) 
can create loop bodies for which split lifetime opti- 
mization is feasible and desirable. 

Variables of Interest Our implementation deals only 
with scalar variables and parameters. This includes 
Alpha's extended precision floating-point (128-bit 



X_Floating) variables as well as variables of any of the 
complex types (see Sites' 6 ). These latter variables are 
referred to as two-part variables because each requires 
two registers to hold its value. 

Currency Definition 

The value of a variable in an optimized program is cur- 
rent with respect to a given position in the source pro- 
gram if the variable holds the value that would be 
expected in an unoptimized version of the program. 
Several kinds of optimization can lead to noncurrent 
variables. Consider the currency example in Figure 1. 

As shown in Figure 1, the optimizing compiler has 
chosen to change the order of operations so that line 4 
is executed prior to line 3. Now suppose that execu- 
tion has stopped at the instruction in line 3 of the 
unoptimized code, the line that assigns a value to vari- 
able C. 

Given a request to display (print) the value of A, 
a traditional debugger will display whatever value 
happens to be contained in the location of A, which 
here, in the optimized code, happens to be the result 
of the second assignment to A. This displayed value 
of A is a correct value, but it is not the expected 
value that should be displayed at line 3. This scenario 
might easily mislead a user into a frustrating and 
fruitless attempt to determine how the assignment 
in line 1 is computing and assigning the wrong 
value. The problem occurs because the compiler has 
moved the second assignment so that it is early rela- 
tive to line 3. 

Another currency example can be seen in the frag- 
ment (taken from Copperman') that appears in Figure 
2. In this case, the optimizing compiler has chosen to 
omit the second assignment to variable A and to assign 
that value directly into die actual parameter location 
used for the call of routine FOO. Suppose that the 
debugger is stopped at the call of routine FOO. Given 
a request to display A, a traditional debugger is likely 
to display the result of the first assignment to A. Again, 
this value is an actual value of A, but it is not the 
expected value. 

Alternatively, it is possible that prior to reaching the 
call, the optimizing compiler has decided to reuse the 



Line Unoptimized 



. A . 



Optimized 

De f i ne A A = . . . ; 

Use A. B = . . . A . 

C does not depend on A A - . . . ; 

Define A C « . . . ; 

Use second definition of A D = ...A. 



Figure 1 

Currency Example 1 
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Line Unoptimized 

1 A ■ expressionl; 

2 B = ... A ... ; 

3 A = express ion2 ; 

4 FOO(A) ; 



Optimized 

A = expressionl; 
Use 1st def . of A B = . . .A. . . 

Use 2nd def. of A FOO (e:\pression2) ; 



Figure 2 

Currency Example 2 



location that originally held the first value of A for 
another purpose. In this case, no value of A is available 
to display at the call of routine FO#. 

Finally, consider the example shown in Figure 3, 
which illustrates that the currency ofa variable is not a 
property that is invariant over time. Suppose that exe- 
cution is stopped at line 5, inside the loop. In this case, 
A is not current during the first time through the loop 
body because the actual value comes from line 3 
(moved from inside the loop); it should come from 
line 1. On subsequent times through the loop, the 
value from line 3 is the expected value, and the value of 
A is current. 

As discussed earlier, most approaches to currency 
determination involve making certain kinds of flow 
graph and compiler optimization information avail- 
able to the debugger so that it can report when a dis- 
played value is not current. However, we wanted to 
avoid adding major new Icinds of analysis capability to 
DIGITAL'S debuggers. 

More fundamentally, as the degree of optimization 
increases, the notion of current position in the program 
itself becomes increasingly ambiguous. Even when the 
particular instruction at which execution is pending can 
be clearly and unequivocally related to a particular source 
location, this location is not automatically the best one to 
use for currency determination. Nevertheless, the source 
location (or set of locations) where a displayed value was 
assigned can be reliably reported without needing to 
establish the current position. 

Accordingly, we use an approach different than 
those considered in the literature. We use a straight- 
forward flow analysis formulation to determine what 



locations hold values of user variables at any given 
point in the program and combine this with the set of 
definition locations that provide those values. Because 
there may be more than one source location, the user 
is given the basic information to determine where in 
the source the value ofa variable may have originated. 
Consequently, the user can determine whether the 
value displayed is appropriate for his or her purpose. 

Compiler Processing 

A compiler performs most split lifetime analysis on a 
routine-by-routine basis. A preliminary walk over the 
entire symbol table identifies the variable symbols that 
are of interest for further analysis. Then, for each rou- 
tine, the compiler performs the following steps: 

■ Code cell prepass 

■ Flow graph construction 

■ Basic block processing 

■ Parameter processing 

■ Backward propagation 

■ Forward propagation 

■ Information promotion and cleanup 

After the compiler completes this processing for 
all routines, a symbol table postwalk performs final 
cleanup tasks. The following contains a brief discus- 
sion of these steps. 

In this summary, we highlight only the main charac- 
teristics of general interest. In particular, we assume that 
each location, such as a register, is independent of all 
other locations. This assumption is not appropriate to 
locations on die stack because variables of different sizes 



Line Unoptimized 

1 A 

2 . . .A. . . ; 
3 

4 
5 
6 
7 



while (...) { 

A = 
) 



// A is loop invariant 



Optimized 

A = 

... A ... ; 

A = 

while (...) { 



Figure 3 

Currency Example 3 
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may overlay each other. The complexity of dealing with 
overlapping allocations is beyond the scope of this paper. 

Of special importance in this processing is the fact 
that each operand of every instruction includes a base 
symbol field that refers to the compiler's symbol table 
entry for the entity that is involved. 

Symbol Table Prewalk The symbol table prewalk 
identifies the variables of interest for analysis. As dis- 
cussed, we are interested in scalars corresponding to 
user variables (not compiler-created temporaries), 
including Alpha's extended precision floating-point 
( 1 28-bit X_Floating) and complex values. 

DIGITAL'S FORTRAN implementations passpara- 
meters using a by-refcrence mechanism with bind 
(rather than copy-in/copy-out) semantics. GEM treats 
the hidden reference value as a variable that is subject 
to split lifetime optimization. Since die reference vari- 
able must be available to effect operations on the logi- 
cal parameter variable, it follows that both the abstract 
parameter and its reference value must be treated as 
interesting variables. 

Code Cell Prepass The code cell prepass performs a 
single walk over all code cells to determine 

■ The maximum and minimum offsets in the stack 
frame that hold any interesting variables 

■ The highest numbered register that is actually refer- 
enced by the code 

■ Whether the stack frame uses a frame pointer that is 
separate from the stack pointer 

The compiler uses these characteristics to preallocate 
various working storage areas. 

Flow Graph Construction A flow graph is built, in 
which each basic block is a node of the graph. 

Basic Block Processing Basic block processing per 
forms a kind of symbolic execution of the instructions 
of each block, keeping track of the effect on machine 
state as execution progresses. 

When an instruction operand writes to a location 
with a base symbol that indicates an interesting vari- 
able, the compiler updates the location description to 
indicate that the variable is now known to reside in 
that location — this begins a lifetime segment. The 
instruction that assigned the value is also recorded 
with the lifetime segment. 

If there was previously a known variable in that loca- 
tion, that lifetime segment is ended (even if it was for 
the same variable). The beginning and ending instruc- 
tions for that segment are then recorded with the vari- 
able in the symbol table. 



When an instruction reads an operand with a base 
symbol that indicates an interesting variable, some 
more unusual processing applies. 

If the variable being read is already known to 
occupy that location, then no further processing is 
required. This is the most common case. 

If the location already contains some other known 
variable, then the variable being read is added to the 
set of variables for that location. This situation can 
arise when there is an assignment of one variable to 
another and the register allocator arranges to allocate 
them both to the same location. As a result, the assign- 
ment happens implicitly. 

If the location does not contain a known variable 
but there is a write operation to that location earlier in 
the same block (a fact that is available from the loca- 
tion description), the prior write is retroactively 
treated as though it did write that variable at the earlier 
instruction. This situation can aiise when the result of 
a function call is assigned to a variable and the register 
allocator arranges to allocate that variable in the regis- 
ter where the call returns its value. The code cell repre- 
sentation for the call contains nothing that indicates a 
write to the variable; all that is known is that the return 
value location is written as a result of the call. Only 
when a later code cell indicates that it is using the value 
of a known variable from that location can we infer 
more of what actually happened. 

If the location does not contain a known variable and 
there is no write to that same location earlier in this 
same basic block, then the defining instruction cannot 
be immediately determined. A location description is 
created for the beginning of the basic block indicating 
that the given variable or set of variables must have 
been defined in some predecessor block. Of course, the 
contents known as a result of the read operation can 
also propagate forward toward the end of the block, 
just as for any other read or write operation. 

Special care is needed to deal with a two-part variable. 
Such a variable does not become defined until both 
instructions that assign the value have been encoun- 
tered. Similarly, any reuse of either of the two locations 
ends the lifetime segment of the variable as a whole. 

At the end of basic block processing, location 
descriptions specify what is known about the contents 
of each location as a result of read and write operations 
that occurred in the block. This description indicates 
the set of variables that occupy the location, or that the 
location was last written by some value that is not the 
value of a user variable, or that the location does not 
change during execution of the block. 

Parameter Processing The compiler models parame 
ters as locations that are defined with the contents of a 
known variable at the entry point of a routine. 
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Backward Propagation Backward propagation iter- 
ates over the flow graph and uses the locations with 
known contents at the beginning of a block to work 
backward to predecessor blocks looking for instruc- 
tions that write to that location. For each variable in 
each input location, any such prior write instruction is 
retroactively made to look like a definition of the vari- 
able. Note that this propagation is not a flow algo- 
rithm because no convergence criteria is involved; it is 
simply a kind of spanning walk. 

Forward Propagation Forward propagation iterates 
over theflowgraph and uses the locations with known 
contents at the end of each block to work forward to 
successor blocks to provide known contents at the 
beginning of other blocks. This is a classic "reaching 
definitions" flow algorithm, in which the input state of 
a location for a block is the intersection of the blown 
contents from the predecessors. 

In our case, the compiler also propagates definition 
points, which are the addresses of the instructions that 
begin the lifetime segments. For those variables that are 
known to occupy a location, the set of definitions is the 
union of all the definitions that flow into that location. 

Information Promotion and Cleanup The final step of 
compiler processing is to combine information for adja- 
cent blocks where possible. This action saves space in the 
debugging symbol table but does not affect the accuracy 
of the description. Descriptions for by-reference bind 
parameters are next merged with the descriptions for the 
associated reference variables. Finally, lifetime segment 
information not already associated with symbol table 
entries is copied back. 

Object File Representation 

The object file debugging symbol table representation 
for split lifetime variables is actually quite simple. 
Instead of a single address for a variable, there is a 
sequence of lifetime segment descriptions. Each life- 
time segment consists of 

■ The range of addresses over which the child loca- 
tion applies 

■ The location (in a register, at a certain offset in the 
current stack frame, indirect through a register or 
stack location, etc.) 

■ The set of addresses that provide definitions for this 
lifetime segment 

By convention, the last segment in the sequence can 
have the address range 0 to FFFFFFFF (hex). This 
address range is used for a static variable, for example 
in a FORTRAN COMMON block, that has a default alio 
cation that applies whenever no active children exist. 



Debugger Processing 

Name resolution, that is, binding a textual name to the 
appropriate entry in the debug symbol table, is in no 
way affected by whether or not a variable has split life- 
time segments. After the symbol table entry is found, 
any sequence of lifetime segments is searched for one 
that includes the current point of execution indicated 
by the program counter (PC). If found, the location of 
the value is taken from that segment. Otherwise, the 
value of the variable is not available. 

Usage Example 

To illustrate how a user sees the results of thjs processing, 
consider the small C program in Figure 4. Note that the 
numbers in the left column are Listing line numbers. 

When DOCT8 is compiled, linked, and executed 
under debugger control, the dialogue shown in Figure 5 
appears. The figure also includes interpretive comments. 

Known Limitations 

The following limitations apply to the existing split 
lifetime support. 

Multiple Active Split Children While the compiler 
analysis correctly determines multiple active split child 
variables and the debug symbol table correctly describes 
them, OpenVMS DEBUG does not currently support 
multiple active child variables. When searching a sym- 
bol's lifetime segments for one that includes the current 
PC, the first match is taken as the only match. 

Two-part Variables Support for two-part variables 
(those occupying two registers) assumes that a com- 
plete definition will occur within a single basic block. 
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i n t i , j , k ; 
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= ; 
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j = 2; 
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k = 3; 
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if (foolil) { 


394 


j = 17; 
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} 


396 


else ! 
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k = 18; 


398 


} 


399 




400 


printf("%d, %d , %d\n", i, j, k) ; 


401 




402 


} 


Figure 4 
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$ run doct8 

OpenVMS Alpha Debug64 Version T7. 2-001 
%I, language is C, module set to DOCT8 
DBG> step/ into 

stepped to DOCT8\doct8\%LINE 391 

3 91: k = 3; 

DBG> examine i, j, k 

%W, entity 'i' was not allocated in memory (was optimized away) 
%W, entity 'j' does not have a value at the current PC 
%W, entity 'k' does not have a value at the current PC 

Note the difference in the message for variable i compared to the messages for variables j and k. We 
see that variable i was not allocated in memory (registers or otherwise), so there is no point in ever 
trying to examine its value again. Variables / and k, however, do not have a value "at the current PC." 
Somewhere later in the program they will have a value, but not here. 

The dialogue continues as follows: 

DBG> step 6 

stepped to DOCT8\doct8\%LINE 391 

391: k = 3; 

DBG> step 

stepped to DOCT8\*oct8\%LINE 393 

393: if (foo(i)) { 

DBG> examine j, k 

%W, entity 'j' does n«t have a value at the current PC 
DOCT8\doct8\k: 3 

value defined at DOCT8\doct8\%LINE 391 

Here we see that / is still undefined but fenow has a value, namely 3, which was assigned at line 391 . 
The source indicates that / was assigned a value at line 390, before the assignment to k, but fs assign- 
ment has yet to occur. 

Skipping ahead in the dialogue to the print statement at line 400, we see the following: 

DBG:> set break %line 4C0 
DBG> go 

break at DOCT8\doct8\%LINE 400 

400: printf <"%d, Id, %d\n", i, j, k); 

DBG> examine j 
DOCT8\doct8\ j : 2 

value defined at DOCT8\doct8\%LINE 390 

value defined at DOCT8\doct8\%LINE 394 
DBG> examine k 
DOCT8\doct8\k: 18 

value defined at DOCT8\doct8\%LINE 397+4 

value defined at DOCT8\doct8\%LINE 391 

This portion of the message shows that more than one definition location is given for both / and k. 
Which of each pair applies depends on which path was taken in the if statement. If a variable has an 
apparently inappropriate value, this mechanism provides a means to take a closer look at those places, 
and only those places, from which that value might have come. 



Figure 5 

Dialogue Resulting from Running DOCT8 



That is, at the end of a basic block, if the second part of 
a definition is missing then the initial part is discarded 
and forgotten. 

Consider the following FORTRAN fragment: 

COMPLEX X, Y 

Y - X + (1.0, 0.0) 



Suppose that the last use of variable X occurs in the 
assignment to variable }'so that A'and Kcan be and are 
allocated in the same location, in particular, the same 
register pair. In this case, the definition of Y requires 
only one instruction, which adds 1 .0 to the real part of 
the location shared by A'and Y. Because there is no sec- 
ond instruction to indicate completion of the defini- 
tion, thedefinition wlU be lost by our implementation. 
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Semantic Stepping 

A major problem with stepping by line though opti- 
mized code is that the apparent source program loca- 
tion "bounces" back and forth, with the same line 
often appearing again and again. In large part this 
bouncing is due to a compiler optimization called 
code scheduling, in which instructions that arise from 
the same source line are scheduled, that is, reordered 
and intermixed with other instructions, for better exe- 
cution performance. 

OpenVMS DEBUG, like most debuggers, interprets 
the STEP/LINE (step by line) command to mean that 
the program should execute until the line number 
changes. Line numbers change more frequently in 
scheduled code than in unoptimized code. 

For example, in sample programs from the SPEC95 
Benchmark Suite, the average number of instructions 
in sequence that share the same line number is typi- 
cally between 2 and 3 — and typically 50 to 70 percent 
of those sequences consist of just 1 instruction! In 
contrast, if only instruction-level scheduling is dis- 
abled, then the average number of instructions is 
between 4 and 6, with 20 to 30 percent consisting of 
one instruction. In a compilation with no optimiza- 
tion, there are 8 to 12 instructions in a sequence, with 
roughly 5 percent consisting of a single instruction. 

A second problem with stepping by line through an 
optimized program is that, because of the behavior of 
revisiting the same line again and again, the user is 
never quite sure when the line has finished executing. 
It is unclear when an assignment actually occurs or a 
control flow decision is about to be made. 

In unoptimized code, when a user requests a break- 
point on a certain line, the user expects execution to 
stop just before that line, hence before the line is car- 
ried out. In optimized code, however, there is no well- 
defined location that is "before the line is carried out," 
because the code for that line is typically scattered 
about, intermixed, and even combined with the code 
for various other lines. It is usually possible, however, 
to identify the instruction that actually carries out the 
effect of the line. 

Semantic Event Concept 

We introduce a new kind of stepping mode called 
semantic stepping to address these problems. Semantic 
stepping allows the program to execute up to, but not 
including, an instruction that causes a semantic effect. 
Instructions that cause semantic effects are instructions 
that 

■ Assign a value to a user variable 

■ Make a control flow decision 

■ Make a routine call 



Not all such instructions are appropriate, however. 
We start with an initial set of candidate instructions 
and refine it. The following sections describe the 
heuristics that are currently in use. 

Assignment The candidates for assignment events 
are the instructions that assign a value to a variable (or 
to one of its split children). The second instruction in 
an assignment to a two-part variable is excluded. 
Stopping between the two assignments is inadvisable 
because at that point the variable no longer has the 
complete old state and does not yet have the complete 
new state. 

Branches There are two kinds of branch: uncondi- 
tional and conditional. An unconditional branch may 
have a known destination or an unknown destination. 
Unconditional branches with known destinations 
most often arise as part of some larger semantic con- 
struct such as an if-then-else or a loop. For example, 
code for an if-then-else construct generally has an 
implicit join that occurs at the end of the statement. 
The join takes the form of a jump from the end of one 
alternative to the location just past the last instruction 
of the other (which has no explicit jump and falls 
through into the next statement). This jump turns the 
inherently symmetric join at the source level into an 
asymmetric construction at the code stream level. 

Unconditional jumps almost never define interest- 
ing semantic events — some related instruction usually 
provides a more useful event point, such as the termi- 
nation test in the case of a loop. One exception is a 
simple goto statement, but these are very often opti- 
mized away in any case. Consequently, unconditional 
branches with known destinations are not treated as 
semantic events. 

Unconditional branches with unknown destina- 
tions are really conditional branches: they arise from 
constructs such as a C switch statement implemented 
as a table dispatch or a FORTRAN assigned GOTO state- 
ment. These branches definitely arc interesting points 
at which to allow user interaction before the new 
direction is taken. Thus, the compiler retains uncon- 
ditional branches as semantic events. 

Similarly, in general, conditional branches to known 
destinations are important semantic event points. Often 
more than one branch instruction is generated for a sin- 
gle high-level source construct, for example, a decision 
tree of tests and branches used to implement a small 
C switch statement. In this case, only the first in the 
execution sequence is used as the semantic event point. 

Calls Most calls are visible to a user and constitute 
semantically interesting events. However, calls to 
some run-time library routines arc usually not interest- 
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ing because these calls are perceived to be merely soft- 
ware implementations of primitive operations, such as 
integer division in the case of the Alpha architecture. 
GEM internally marks calls to all its own run-time sup- 
port routines as not semantically interesting. Compiler 
front ends accomplish this where appropriate for their 
own set of run-time support routines by setting a flag 
on the associated entry symbol node. 

Compiler Processing 

In most cases, the compiler can identify semantic event 
locations by simple predicates on each instruction. 
The exceptions are 

■ The second of the two instructions that assign val- 
ues to a two-part variable is identified during split 
lifetime analysis. 

■ Conditional branches that are part of a larger con- 
struct are identified during a simple pass over the 
flow graph. 

Object Module Representation 

The object module debugging semantic event repre- 
sentation contains a sequence of address and event 
kind pairs, in ascending address order. 

Debugger Processing 

Semantic stepping in the debugger involves a new 
algorithm for determining the range of instructions to 
execute. This algorithm is built on a debugger primi- 
tive mechanism that supports full-speed execution of 
user instructions witliin a given range of addresses but 
traps any transfer out of that range, whether by reach - 
ing the end or by executing any kind of branch or call 
instruction. 

Semantic stepping works as follows. Starting with 
the current program counter address, OpenVMS 
DEBUG finds the next higher address that is a seman- 
tic event point; this is the target event point. 
OpenVMS DEBUG executes instructions in the 
address range that starts at the address of the current 
instruction and ends at the instruction that precedes 
the target event point. The range execution terminates 
in the following two cases: 

1 . If the next instruction to execute is the target event 
point, then execution reached the end of target 
range and the step operation is complete. 

2. If the next instruction to execute is not the target 
event point, then the next address becomes the cur- 
rent address and the process repeats (silently). 

Note that, unlike the algorithm that determines the 
range for stepping by line, the new algorithm does not 
require an explicit test for the kind of instruction, in 
particular, to test if it is a kind of branch. The compiler 



already marks branches with the semantic event 
attribute, if appropriate. Also unlike the traditional 
stepping- by-line algorithm, the new algorithm does 
not consider the source line number. 

Visible Effect 

With semantic stepping, a user's perception of forward 
progress through the code is no longer dominated by 
the side effects of code scheduling, that is, stopping 
every few instructions regardless of what is happening. 
Rather, this perception is much more closely related to 
the actual semantic behavior, that is, stopping every 
statement or so, independent of how many instruc- 
tions from disparate statements may have executed. 

Note that jumping forward and backward in the 
source may still occur, for example, when code motions 
have changed the order in which semantic actions are 
performed. Nothing about semantic event handling 
attempts to hide such reordering. 

Inlining 

Procedure call inlining can be confusing when using a 
traditional debugger. For example, if routine INNER 
is inlined into routine CALLER and the current point 
of execution is within INNER, should the debugger 
report the current source location as at a location in 
the caller routine or in the called routine? Neither is 
completely satisfactory by itself. If the current line is 
reported as at the location within INNER, then that 
information will appear to conflict with information 
from a call stack traceback, which would not show 
routine INNER. If the current line is reported as 
though in CALLER, then relevant location informa- 
tion from the callee will be obscured or suppressed. 
Worse yet, in the case of nested inlining, potentially 
crucial information about the intermediate call path 
may not be available in any form. 

The problem of dealing with inlining was solved 
long ago by Zellweger 1 — at least the topic has not 
been treated again since. Zellweger's approach adds 
additional information to an otherwise traditional table 
that maps from instruction addresses to the corre- 
sponding source line numbers. Our approach is differ- 
ent: it includes additional information in the scope 
description of the debugging symbol table. 

A key underpinning for inline support is the ability 
to accurately describe scopes that consist of multiple 
discontiguous ranges of instruction addresses, rather 
thaji the traditional single range. This capability is 
quite independent of inlining as such. However, 
because code from an inlined routine is freely sched- 
uled with other code from the calling context, dealing 
accurately with the resulting disjoint scopes is an 
essential building block for effective support. 
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Goals for Debugger Support 

Our overall goal is to support debugging of in lined 
code with expected behavior, that is, as though the 
inlining has not occurred. More specifically, we seek to 
provide the ability to 

■ Report the source location corresponding to the 
current position in the code 

■ Display parameters and local variables of an Lnlined 
routine 

■ Show a traceback that includes call frames corre- 
sponding to in lined routines 

■ Set a breakpoint at a given routine entry 

■ Set a breakpoint at a given line number (from 
within an inlined routine) 

■ Call an inlined routine 

We have achieved these goals to a substantial extent. 
GEM Locators 

Before describing the mechanisms for inlining, we 
introduce the GEiM notion of a locator. A locator 
describes a place in the source text. The simplest kinds 
of locator describe a point in die source, including the 
name of the file, the line within that file, and the col- 
umn within that line; they even describe the point at 
which that file was included by another file (as for a C 
or C++ #include directive), if applicable. 

A crucial characteristic of locators is that they are all 
of a uniform fixed size that is no larger than an integer 
or pointer. (How this is achieved is beyond the scope 
of this paper.) In particular, locators are small enough 
that every tuple node in the intermediate language 
(IL) and everv code cell in the generated code stream 
contains one. Moreover, GKM as a whole is quite 
meticulous about maintaining and propagating high- 
quality locator information throughout its optimiza- 
tion and code generation. 

Aji additional kind of locator was introduced for 
inlining support. This inline locator encodes a pair 
that consists of a locator (which may also be an inline 
locator) and the address of an associated scope node in 
the GEM symbol table. 

Compiler Processing 

Debugging optimized code support for inlining gen- 
erally builds on and is a minor enhancement of the 
GEM inlining mechanism. Inlining occurs during an 
early part of the GEM optimizer phase. 

Inlining is implemented in GEM as follows: 

■ Within the scope that contains die call site, an inline 
scope block is introduced. This scope represents the 
result of the inlining operation. It is populated with 
local variable declarations that correspond one-to- 
one widi the formal parameters of the inlined routine. 



■ The actual arguments of the call are transformed 
into assignments that initialize the values of the sur- 
rogate parameter variables. 

■ The inline scope is also made to contain a bociv 
scope, which is a copy of the body of the inlined 
routine, including a copy of its local variables. 

■ The original call is replaced with a jump to a copy of 
the IL for the body of the routine, in which refer- 
ences to declarations or parameters of the routine 
are replaced with references to their corresponding 
copied declarations. In addition, returns from the 
routine are replaced with jumps back to the tuple 
following the original call. 

■ Similar "boundary adjustments" are made to deal 
with function results, output parameters, choice of 
entry point (when there is more than one, as might 
occur for FORTRAN alternate entry statements), 
etc. (The bookkeeping is a bit intricate, but it is 
conceptually straightforward.) 

The calling routine, which now incorporates a copy 
of the inlined routine, is then further processed as a 
normal (though larger) routine. 

Inlining Annotations for Debugging The main changes 
introduced for debugging optimized code support are 
as follows. 

■ The newly created inline scope block is annotated 
with additional information, namely, 

- A pointer to the routine declaration being inlined. 

- The locator from the call that is replaced. In a sim- 
ple call with no arguments, there may be nodiing 
left in the IL from die original call alter inlining is 
completed; diis locator captures the original call 
location for possible later use, for example, as a 
supplement to die information that maps instruc- 
tion addresses to source line numbers. 

■ As the code list of the original inlined routine is 
copied, each locator from the original is replaced by 
a new inline locator that records 

- The original locator. 

- The newly created inline scope into which it is 
being copied. 

As a result of these steps, every inlined instruction can 
be related back to the scope into which it was inlined 
and hence to the routine from which it was inlined, 
regardless of how it may be modified or moved as a 
result of subsequent optimization. 

Note that these additional steps are an exception to 
the general assertion that debugging optimized code 
support occurs after code generation and just prior to 
object code emission. These steps in no way influence 
the generated code — only the debugging symbol table 
that is output. 
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Prologue and Epilogue Sets The prologue of a rou- 
tine generally consists of those instructions at the 
beginning of the routine that establish the routine 
stack frame (for example, allocate stack and save the 
return address and other preserved registers) and that 
must be executed before a debugger can usefully inter- 
pret the state of the routine. For this reason, setting a 
breakpoint at the beginning of a routine is usually 
(transparently) implemented by setting a breakpoint 
after the prologue of that routine is completed. 

Conversely, the epilogue of a routine consists of 
those instructions at the end of a routine that tear 
down the stack frame, reestablish the caller's context, 
and make the return value, if any, available to the 
caller. For this reason, stopping at the end of a routine 
is usually (transparently) implemented by setting a 
breakpoint before the epilogue of that routine begins. 

One benefit of inlining is that most prologue and 
epilogue code is avoided; however, there may still be 
some scope management associated with scope entry 
and exit. Also, some programming language-related 
environment management associated with the scope 
may exist and should be treated in a manner analogous 
to traditional prologue and epilogue code. The prob- 
lem is how to identify it, because most of the tradi- 
tional compiler code generation hooks do not apply. 

The model we chose takes advantage of the seman- 
tic event information that we describe in the section 
Semantic Stepping. In particular, we define the first 
semantic event that can be executed within the inlined 
routine to be the end of the prologue. For reasons dis- 
cussed later, we define the last instruction (not the last 
semantic event) of the inlined code as the beginning of 
the epilogue. As a result of unrelated optimization 
effects, each of these may turn out to be a set of 
instructions. Determination of inline prologue and 
epilogue sets occurs after split lifetime and semantic 
event determination is completed so that the results of 
those analyses can be used. 

To determine the set of prologue instructions, for each 
inline instance, GEM starts with every possible entry 
block and scans forward through the How graph looking 
for the first semantic event instruction that can be reached 
from that entry. The set of such instructions constitutes 
the prologue set for that instance of the inlined routine. 

This is a spanning walk forward from the routine 
entry (or entries) that stops either when a block is 
found to contain an instruction from the given inline 
instance or when the block has already been encoun- 
tered (each block is considered at most once). Note 
that there may be execution paths that include one or 
more instructions from an inlining, none of which is a 
semantic event instruction. 

The set of epilogue instructions is determined using 
an inverse of the prologue algorithm. The process 
starts with each possible exit block and scans backward 



through the flow graph looking for the last instruction 
(that is, the instruction closest to the routine exit) of 
an inline instance that can reach an exit. 

Note that prologue and epilogue sets are not strictly 
symmetric: prologue sets consist of only instructions that 
are also semantic events, whereas epilogue sets include 
instructions that may or may not be semantic events. 

Object Module Representation 

To describe any inlining that may have occurred dur- 
ing compilation, we include three new kinds of infor- 
mation in the debugging symbol table. 

If the instructions contained in a scope do not fonn a 
single contiguous range, then the description of the 
scope is augmented with a discontiguous range descrip- 
tion. This description consists of a seq uence of ranges. 
(The scope itself indicates the U'aditional approximate 
range description to provide backward compatibility 
with older versions of OpenVMS DEBUG). This aug- 
mented description applies to all scopes, whether or not 
they are the result of inlining. 

For a scope that results from inlining a call, the 
description of the scope is augmented with a record 
that refers to the routine that was inlined as well as the 
line number of the call. Each scope also contains two 
entries that consist of the sequence of prologue and 
epilogue addresses, respectively. 

Backward compatibility is fully maintained. An older 
version of OpenVMS DEBUG that does not recognize 
the new kinds of information will simply ignore it. 

Debugger Processing 

As the debugger reads the debugging symbol table of 
a module, it constructs a list of the inlined instances for 
each routine. This process makes it possible to find all 
instances of a given routine. Note, however, that if every 
call of the routine is expanded inline and the routine 
cannot otherwise be called from outside that module, 
then GEM does not create a noninlined (closed-form) 
version of the routine. 

Report Source Location It is a simple process to report 
the source location that corresponds to the current code 
address. When stopped inside the code resulting from 
an inlined routine, the program counter maps directly 
to a source line within the inlined routine. 

Display Parameters and Local Variables As is the case 
for a noninlined routine, the scope description for an 
inlined routine contains copies of the parameters and 
the local variables. No special processing is required to 
perform name binding for such entities. 

Include Inlined Calls in Traceback The debugger pre- 
sents inlined routines as if they are real routine calls. A 
stack frame whose current code address corresponds 
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to an inlined routine instance is described with two or 
more virtual stack frames: one or more for the inlined 
instance(s) and one for the ultimate caller. (An exam- 
ple is shown later in Figure 7 . ) 

Set Breakpoints at Inlined Routine Instances The 

strategy for setting breakpoints at inlined routines is 
based on a generalization of processing that previously 
existed for C++ member functions. Compilation of 
C++ modules can result in code for a given member 
function being compiled every time the class or tem- 
plate definition that contains the member function is 
compiled. We refer to all these compilations as clmnes. 
(It is not necessary to distinguish which of them is the 
original ) In our generalization, an inlined routine call 
instance is treated like a clone. To set a breakpoint at a 
routine, the debugger sets breakpoints at all the end- 
of-prologue addresses of every clone of die given rou- 
tine in all the currently active modules. 

Set Breakpoints at Inlined Line Number Instances The 

strategy for setting breakpoints on line numbers shares 
some features of setting breakpoints on routines, with 
additional complications. Compiler-reported line num- 
bers on OpenVMS systems are unique across all the 
files included in a compilation. It follows that the same 
file included in more than one compilation may have 
different associated line numbers. 

To set a breakpoint at a particular line number, 
that line number needs to be first normalized relative 
to the containing file. This normalized line number 
value is then compared to normalized line numbers 
for that same file that are included in other compila- 
tions. (If different versions of the same named file 
occur in different compilations, the versions are 
treated as unrelated.) The original line number is 
converted into the set of address ranges that corre- 
spond to it in all modules, taking into account inlin- 
ing and cloning. 

Call a Routine That Is Inlined If die compiler creates a 
closed-form version of a routine, then the debuggcr 
can call that routine independent of whether there 
may also be inlined instances of the routine. If no such 
version of the routine exists, then the debugger cannot 
call the routine. 

Usage Example 

Inlining support has many aspects, but we will illus- 
trate only one — a call traceback that includes inlined 
calls. Consider the sample program shown in Figure 6. 
This program has four routines: three are combined in 
;i single file (enabling the GEM FORTRAN compiler 
to perform inline optimizations), and the last is in a 
separate file. To help correlate the lines of code in 



Line 



+ + + 


File DOCFJ-INLINE-2 


I 


C 


Main routine 


2 


c 




3 




INTEGER A, C 


4 




TYPE *, A (3, CIO) ) 


0 




END 


6 


c 




7 




FUNCTION All, L) 


8 




INTEGER A, B 


9 




A <= B(5, I) + 2*L 


10 






1 1 




END 


12 


c 




13 




FUNCTION B(J, K) 


14 




INTEGER B, C 


15 




B - C(9) + J + K 


16 




END 






+++ File DOCFJ-INLINE- 


1 


c 




2 




FUNCTION C(I> 


3 




INTEGER C 



4 C = 2*1 

5 RETURN 
6 



Figure 6 

Program to Illustrate In lining Support 



these two files with those in Figure 7, we added line 
numbers to the left of the code. Note that these num- 
bers are not part of the program. 

If we compile, link, and run this program using the 
OpenVMS DEBUG option, we can step to a place in 
routine B that is just before the call to routine C and 
then request a traceback of the call stack. This dialogue 
is shown in Figure 7. 

Figure 7 shows that pscudo stack frames are reported 
for routines A and B, even though the call of routine B 
has been inlined into routine A and the call of routine A 
has been inlined into the main program. The main dif- 
ference from a real stack frame is the extra line that 
reports that tlx "above routine is inlined." 

Limitations 

In a real stack frame, it is possible to examine (and 
even deposit into) the real machine registers, rather 
than examine the variables that happen to be allocated 
in machine registers. In an inlined stack frame, this 
operation is not well defined and consequently not 
supported. In a noninlined stack frame, these opera- 
tions are still allowed . 

An attractive feature that would round out the 
expected behavior of inlined routine calls would be to 
support stepping into or over the inlined call in the 
same way that is possible for noninlined calls. This fea- 
ture is not currently supported — execution always 
steps into the call. 



Digital Technical Journal 



Vol. 10 No. 1 1998 



GEMEVNS run DOCFJ-INLINE-2 

OpenVMS Alpha Debug64 Version 11. 2-001 
%I, Language: FORTRAN, Module: DOCFJ-INLINE-2 $MAIN 

DBG> step /semantic 

stepped to DOCFJ-INLINE-2$MAIN\A\B\%LINE 15^8 

15: B = C(9) - J + K 

DBG> show calls 

module name routine name line rel PC abs PC 

*DOCFJ-INLINE-2$MAIN 

B 15 000000000000001C 000000000002006C 

above routine is inlined 

*DOCFJ-INLINE-2$MAIN 

A 9 0000000000000004 0000000000020054 

above routine is inlined 

*DOCFJ-INLINE-2$ AIN 

DOCFJ-INLINE-2 $MAIN 

4 0000000000000038 0000000000020038 
0000000000000000 FFFFFFFF8590716C 



Figure 7 

OpenVMS DEBUG Dialogue co Illustrate Inliiiing Support 

Performance and Resource Usage 

We gathered a number of statistics to determine typi- 
cal resource requirements for using the enhanced 
debugging optimized code capability compared to the 
traditional practice of debugging unoptimized code. A 
short summary of die findings follows. 

■ All metrics tend to show wide variance from pro- 
gram to program, especially small ones. 

■ Generating traditional debugging symbol information 
increases the size of object modules typically by 50 to 
100 percent on the OpenVMS system. Executable 
image sizes show similar but smaller size increases. 

■ Generating enhanced symbol table information 
adds about 2 to 5 percent to the typical compilation 
time, although higher percentages have been seen 
for unusually large programs. 

■ Generating enhanced symbol table information 
uses significant memory during compilation but 
does not affect the peak memory requirement of a 
compilation. 

■ Generating enhanced symbol table information 
further increases the size of the symbol table infor- 
mation compared to that for an unoptimized com- 
pilation. On the OpenVMS system, this adds 100 to 
200 percent to the debugging symbol table of 
object modules and perhaps 50 to 100 percent for 
executable images. 

■ Compiling with full optimization reduces the 
resulting image size. Total net image size increases 
typically by 50 to 80 percent. 

A more detailed presentation of findings follows. 
Tables 1 through 3 present data collected using pro- 
duction OpenVMS Alpha native compilers built in 
December 1996. In developing these results, we used 
five combinations of compilation options as follows: 



SI: no optimization (noopt), no debugging infor- 
mation (nodebug, nodbgopt) 

S2: no optimization (noopt), normal debugging 
information (debug, nodbgopt) 

S4: full (default) optimization (opt), no debugging 
information (nodebug, nodbgopt) 

S5: Rill optimization (opt), normal debugging 
information only (debug, nodbgopt) 

S8: full optimization (opt), enhanced debugging 
information (debug, dbgopt) 

Note that the option combination numbering sys- 
tem is historical; we retained the system to help keep 
data logs consistent over time. 

Compile-time Speed 

The incremental compile-time cost of creating enhanced 
symbol table information is presented in Table 1 for a 
sampling of BLISS, C, and FORTRAN modules. The 
data in this table can be summarized as follows: 

■ Traditional debugging (column 1) increases the 
total compilation time by about 1 percent. 

■ Enhanced debugging (column 2) increases the 
compilation time by about 4 percent. The largest 
component of that time, approximately 3 percent, 
is attributed to the flow analysis involved in han- 
dling split lifetime variables (column 3). 

■ Debugging tends to increase as a percentage of 
time in larger modules, which suggests that pro- 
cessing time is slightly nonlinear in program size; 
however, this increase does not seem to be excessive 
even in very large modules. 

Compile-time Space 

The compile-time memory usage during the creation of 
enhanced symbol information is presented in Table 2. 
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Table 1 

Percent of Compilation Time Used to Create/Output Debugging Information 



S2(noopt, debug, S8 (opt. debug, (Split Lifetime 

Module nodbgopt) dbgopt) Analysis Only) 



BLISS CODE 



GEM_AN 


0.3% 


1.1% 


0.7% 


GEM_DB 


0.9 


1.8 


1.3 


GEM_DF 


0.8 


5.2 


4.4 


GEM_FB 


0.7 


3.5 


2.7 


GEM_IL_PEEP 


0.6 


14.4 


13.9 






CCODE 




C^METRIC 


1.5 


5.2 


4.1 


GRAM 


0.5 


2.9 


2.2 


INTERP 


1.2 


4.5 


3.2 






FORTRAN CODE 




MATRIX300X 


nm 


nm 


nm 


NAGL 


1.4 


13.0 


11.9 


SPICE_V07 


3.0 


6.4 


4.7 


WAVEX 


2.5 


6.3 


4.8 


Average 


1.2% 


4.3% 


3.2% 


Typical range 


(0.5%-1 .5%) 


(3.0%-7.0%) 


(2.0%-5.0 


Note: "nm" represents "not meaningful," that is, too sma 


to be accurately measured. 





Table 2 

Key Dynamic Memory Zone Sizes during BLISS GEM Compilations 





Peak 


SYMBOL 


EIL 


CODE 


OM 


% 


% 


% 


File 


Total 


ZONE 


ZONE 


ZONE 


ZONE 


Peak 


Larg 


EIL 










BLISS CODE 










GEM_AN 


2,507 


130 


85 


184 


15 


6% 


8% 


18% 


GEM DF 


11,305 


836 


1,672 


2,056 


1,180 


10 


57 


71 


GEM FB 


4,694 


316 


522 


457 


304 


6 


58 


58 


GEMJL_PEEP 


40,419 


1,606 


17,666 


4,411 


14,143 


34 


80 


80 










CCODE 










C_METRIC 


7,381 


1,115 


494 


2,563 


167 


2 


6 


34 


GRAM 


3,031 


82 


815 


211 


267 


9 


33 


33 


INTERP 


3,563 


354 


308 


688 


131 


4 


20 


43 










FORTRAN CODE 










MATRIX300X 


934 


143 


227 


101 


58 


6 


26 


26 


NAGL 


6,267 


1,520 


1,791 


1,742 


68 


11 


38 


38 


SPICE_V07 


6,234 


1,051 


3,256 


885 


459 


7 


14 


14 


WAVEX 


12,812 


4,676 


3,119 


3,482 


68 


5 


14 


22 


Average 












9% 


32% 


40% 



Note: All numberstothe left of the vertical bar are thousands of bytes, not multiples of 1,024, 

Column Key: 

Column Description 

Peak Total The peak dynamic memory allocated in all zones during the compilation 

SYMBOLZONE Thezonethatholdsthe GEM symboltable 

EIL ZONE The zone that holds the largest EIL ZONE (used forthe expanded intermediate representation) 

CODE ZONE The zone that holds the GEM generated code list 

OM ZONE The zone that holds split lifetime and other working data 

%Peak The OM ZONE size as a percentage of the Peak Total size 

%Larg The OM ZONE size as a percentage of the largest single zone in the compilation 

%EIL The OM ZONE size as a percentage of the EIL ZONE size 
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The following is a summary of the data, where OM 
ZONE refers to the temporary working virtual mem- 
ory zone used for split lifetime analysis: 

■ The OM ZONE size averages about 10 percent of 
the peak compilation size. 

■ The OM ZONEsizeisone-quartertoone-halfofthe 
EIL ZONE size. (The latter is well known for typi- 
cally being the largest zone in a GEM compilation.) 

■ Since the OM ZONE is created and destroyed after all 
EIL ZONEs are destroyed, the OM ZONE does not 
contribute to establislung die peak total size. 

Object Module Size 

The increased size of enhanced symbol table informa- 
tion for both object files and executable image files is 
shown in Table 3. 

In Table 3, the application or group of modules is iden- 
tified in die first column. The columns labeled S 1 , S2, etc. 
give tine resulting size for the combination of compilation 
options described earlier. Object module and executable 
image data is presented in successive rows. 

Three ratios of particular interest are computed. 

S2/S1: This ratio shows the object or image size 
with traditional debugging information compared 
to a base compilation without any debugging infor- 
mation. This ratio indicates the additional cost, in 
terms of increased object and image file size, associ- 
ated widi doing traditional symbolic debugging. 

(S8-S5)/(S2-S1): Tliis ratio shows the increase in 
debugging symbol table size (exclusive of base object, 

Table 3 



image text, etc.) due to the inclusion of enhanced infor- 
mation compared to the traditional symbol table size. 

SS/S2: This ratio shows the object or image size 
with enhanced debugging information with opti- 
mization compared to the traditional debugging 
size without optimization. 

The last ratio, S8/S2, is especially interesting because 
it combines two effects: (1) the reduction in size as a 
result of compiler optimization, and (2) the increase in 
size because the larger debugging symbol table needed 
to describe the result of the optimization. The result- 
ing net increase is reasonably modest. 

Summary and Conclusions 

There exists a small but significant literature regarding 
the debugging of optimized code, yet very few debug- 
gers take advantage of what is known. In tliis paper we 
describe the new capabilities for debugging optimized 
code that are now supported in the GEM compiler sys- 
tem and the OpenVMS DEBUG component of the 
OpenVMS Alpha operating system. These capabilities 
deal with split lifetime variables and currency determi- 
nation, semantic stepping, and procedure inlining. For 
each case, we describe the problem addressed and then 
present an overview of GEM compiler and OpenVMS 
DEBUG processing and the object module represen- 
tation that mediates between them. All but the inlin- 
ing support are included in OpenVMS DEBUG V7.0 
and in GEM based compilers for Alpha systems that 
have been shipping since 1996. The inlining support is 



Object/Executable (.OBJ/.EXE) File Sizes (in Number of Blocks) for Various OpenVMS Components 


File 


S1 

noopt 

nodebug 

nodbgopt 


52 

noopt 
debug 
nodbgopt 


S2/S1 
Ratio 


S4 
opt 

nodebug 
nodbdopt 


55 
opt 
debug 
nodbgopt 


58 
opt 
debug 
dbgopt 


(S8-S5)/ 

(S2-S1) 

Ratio 


S8/S2 
Ratio 










BLISS CODE 










GEM_*.OBJ 


31,477 


51,069 


1.62 


27,483 


47,031 


68,728 


1.11 


1.35 


GEM_*.EXE 


12,160 


29,543 


2.43 


10,373 


27,755 


32,288 


0.26 


1.09 










CCODE 










C_METRIC.OBJ 


436 


653 


1.50 


478 


733 


1,680 


4.36 


2.57 


C_METRIC.EXE 


250 


348 


1.39 


250 


385 


581 


2.00 


1.67 


GRAM. OBJ 


102 


120 


1.19 


100 


117 


224 


5.94 


1.87 


GRAM. EXE 


60 


70 


1.17 


58 


69 


91 


2.20 


1.30 


INTERP.OBJ 


140 


207 


1.48 


134 


205 


450 


3.66 


2.17 


INTERP.EXE 


80 


113 


1.41 


75 


113 


167 


1.64 


1.47 










FORTRAN CODE 










MATRIX300X.OBJ 


20 


34 


1.70 


16 


29 


71 


3.00 


2.08 


MATRIX300X.EXE 


19 


29 


1.53 


15 


25 


34 


0.90 


1.17 


NAGL.OBJ 


42 


63 


1.51 


288 


509 


1,178 


3.11 


1.84 


NAGL.EXE 


289 


388 


1.34 


187 


333 


469 


1.37 


1.21 


SPICE.OBJ 


1,652 


3,117 


1.89 


1,073 


2,571 


4,916 


1.60 


1.58 


SPICE.EXE 


1,031 


1,660 


1.61 


549 


1,318 


1,803 


0.77 


1.09 


WAVEX.OBJ 


555 


1,639 


2.95 


393 


1,556 


2,949 


1.29 


1.80 


WAVEX.EXE 


634 


1,190 


1.88 


490 


1,167 


1,437 


0.49 


1.21 
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currently in field test. Work is under way to provide 
similar capabilities in the ladebug debugger 1 ls compo- 
nent of the DIGITAL UNIX operating system. 

There are and will always be more opportunities and 
new challenges to improve the ability to debug opti- 
mized code. Perhaps the biggest problem of all is to fig- 
ure out where best to focus future attention. It is easy to 
see how the capabilities described in this paper provide 
major benefits. We find it much harder to see what capa- 
bility could provide the next major increment in debug- 
ging effectiveness when working with optimized code. 
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I 

William M. McKeeman 

Differential Testing 
for Software 



Differential testing, a form of random testing, 
is a component of a mature testing technology 
for large software systems. It complements 
regression testing based on commercial test 
suites and tests locally developed during prod- 
uct development and deployment. Differential 
testing requires that two or more comparable 
systems be available to the tester. These sys- 
tems are presented with an exhaustive series 
of mechanically generated test cases. If (we 
might say when) the results differ or one of 
the systems loops indefinitely or crashes, the 
tester has a candidate for a bug-exposing test. 
Implementing differential testing is an interest- 
ing technical problem. Getting it into use is an 
even more interesting social challenge. This 
paper is derived from experience in differential 
testing of compilers and run-time systems at 
DIGITAL over the lastfew years and recently 
at Compaq. A working prototype for testing 
C compilers is available on the web. 



The Testing Problem 

Successful commercial computer systems contain tens 
of millions of lines of handwritten software, all of 
which is subject to change as competitive pressures 
motivate the addition of new features in each release. 
As a practical matter, quality is not a question of cor- 
rectness, but rather of how many bugs are fixed and 
how few are introduced in the ongoing development 
process. If the bug count is increasing, the software is 
deteriorating. 

Quality 

Testing is a major contributor to quality — it is the last 
chance for the development organization to reduce 
the number of bugs delivered to customers. Typically, 
developers build a suite of tests that the software must 
pass to advance to a new release. Three major sources 
of such tests are the development engineers, who 
know where to probe the weak points; commercial test 
suites, which are the arbiters of conformance; and cus- 
tomer complaints, which developers must address to 
win customer loyalty. All three types of test cases are 
relevant to customer satisfaction and therefore have 
value to the developers. The resultant test suite for the 
software under test becomes intellectual property, 
encapsulates the accumulated experience of problem 
fixes, and can contain more lines of code than the soft- 
ware itself. 

Testing is always incomplete. The simplest measure 
of completeness is statement coverage. Instrumentation 
can be added to the software before it is tested. When 
a test is run, the instrumentation generates a report 
detailing which statements are actually executed. 
Obviously, code that is not executed was not tested. 
Random testing is a way to make testing more com- 
plete. One value of random testing is introducing the 
unexpected test — 1 ,000 monkeys on the keyboard can 
produce some surprising and even amusing input! The 
traditional approach to acquiring such input is to let 
university students use the software. 

Testing software is an active field of endeavor. 
Interesting starting points for gathering background 



IN Digital Technical Journal 



VoJ. 10 No. ] 1998 



information and references are the web site main- 
tained by Software Research, Inc. 1 and the book 
Software Testing and Quality Assurance 1 

Developer Distaste 

A development team with a substantial bug backlog 
does not find it helpful to have an automatic bug 
finder continually increasing the backlog. The team 
priority is to address customer complaints before deal- 
ing with bugs detected by a robot. Engineers argue 
that the randomly produced tests do not uncover 
errors that are likely to bother customers. "Nobody 
would do that" "That error is not important," and 
"Don't waste our time; we have plenty of real errors 
to fix" are typical developer retorts. 

The complaints have a substantia] basis. During a visit 
to our development group, Professor C. A. R. Hoare of 
Oxford University succinctly summarized one class of 
complaints: "You cannot fix an infinite number of bugs 
one at a time." Some software needs a stronger remedy 
than a stream of bug reports. Moreover, a stream of bug 
reports may consume the energy that could be applied 
in more general and productive ways. 

The developer push back just described indicates that 
a differential testing effort must be based on a per- 
ceived need for better testing from within the product 
development team. Performing the testing is pointless 
if the developers cannot or will not use the results. 

Differential testing is most easily applicable to soft- 
ware whose quality is already under control, that is, 
software for which there are few known outstanding 
errors. Running a very large number of tests and 
expending team effort only when an error is found 
becomes an attractive alternative. Team members' 
morale increases when the software passes millions of 
hard tests and test coverage of their code expands. 

The technology should be important for applica- 
tions for which there is a high premium on correct- 
ness. In particular, product differentiation can be 
achieved for software that has few failures in compari- 
son to the competition. Differential testing is designed 
to provide such comparisons. 

The technology should also be important for appli- 
cations for which there is a high premium on indepen- 
dently duplicating the behavior of some existing 
application. Identical behavior is important when old 
software is being retired in favor of a new implementa- 
tion, or when the new software is challenging a domi- 
nant competitor. 

Seeking an Oracle 

The ugliest problem in testing is evaluating the result 
of a test. A regression harness can automatically check 
that a result has not changed, but this information 
serves no purpose unless the result is known to be cor- 



rect. The very complexity of modern software that 
drives us to construct tests makes it impractical to pro- 
vide a priori knowledge of the expected results. The 
problem is worse for randomly generated tests. There 
is not likely to be a higher level of reasoning that can 
be applied, which forces the tester to instead follow 
the tedious steps that the computer will carry out dur- 
ing the test run. An oracle is needed. 

One class of results is easy to evaluate: program 
crashes. A crash is never the right answer. In the triage 
that drives a maintenance effort, crashes are assigned to 
the top priority category. Although this paper does not 
contain an in-depth discussion of crashes, all crashes 
caused by differential testing are reported and consti- 
tute a substantial portion of the discovered bugs. 

Differential testing, which is covered in the following 
section, provides part of the solution to the problem of 
needing an oracle. The remainder of the solution is dis- 
cussed in the section entitled Test Reduction. 

Differential Testing 

Differential testing addresses a specific problem — the 
cost of evaluating test results. Every test yields some 
result. If a single test is fed to several comparable pro- 
grams (for example, several C compilers), and one pro- 
gram gives a different result, a bug may have been 
exposed. For usable software, very few generated tests 
will result in differences. Because it is feasible to gener- 
ate millions of tests, even a few differences can result in 
a substantial stream of detected bugs. The trade-off is 
to use many computer cycles instead of human effort to 
design and evaluate tests. Particle physicists use the 
same paradigm: they examine millions of mostly boring 
events to find a few high-interest particle interactions. 

Several issues must be addressed to make differen- 
tial testing effective. The first issue concerns the qual- 
ity of the test. Any random string fed to a C compiler 
yields some result — most likely a diagnostic. Feeding 
random strings to the compiler soon becomes unpro- 
ductive, however, because these tests provide only 
shallow coverage of the compiler logic. Developers 
must devise tests that drive deep into the tested com- 
piler. The second issue relates to false positives. The 
results of two tested programs may differ and yet 
still be correct, depending on the requirements. For 
example, a C compiler may freely choose among alter- 
natives for unspecified, undefined, or implementation- 
defined constructs as detailed in the C Standard.' 
Similarly, even for required diagnostics, the form of 
the diagnostic is unspecified and therefore difficult to 
compare across systems. The third issue deals with the 
amount of noise in the generated test case. Given a 
successful random test, there is likely to be a much 
shorter test that exposes the same bug. The developer 



Digital Technical Journal 



Vol. 10 No. 1 199S 



who is seeking to fix die bug strongly prefers to use the 
shorter test. The fourth issue concerns comparing pro- 
grams that must run on different platforms. Differential 
testing is easily adapted to distributed testing. 

Test Case Quality 

Writing good tests requires a deep knowledge of the 
system under test. Writing a good test generator 
requires embedding that same knowledge in the gen- 
erator. This section presents the testing of C compilers 
as an example. 

Testing C Compi lers 

For a C compiler, we constructed sample C source files 
at several levels of increasing quality. 

1 . Sequence of ASCII characters 

2. Sequence of words, separators, and white space 

3. Syn tactically correct C program 

4. Type-correct C program 

5. Statically conforming C program 

6. Dynamically conforming C program 

7. Model-conforming C program 

Given a test case selected from any level, we con- 
structed additional nearby test cases by randomly 
adding or deleting some character or word from the 
given test case. An altered test case is more likely to 
cause the compilers to issue a diagnostic or to crash. 
Both the selected and the altered test cases are valuable. 

One of the more entertaining testing papers reports 
the results of feeding random noise to the C run-time 
library. 4 A typical library function crashed or hung on 30 
percent of tine test cases. C compilers should do better, 
but this hypothesis is worth checking. Only rarely 
would a tested compiler faced with level 1 input execute 
any code deeper dian the lexer and its diagnostics. One 
test at this level caused the compiler to crash because an 
input line was too long for the compiler's buffer. 

At level 2, given lexically correct text, parser error 
detection and diagnostics are tested, and at the same 
time the lexer is more thoroughly covered. The C 
Standard describes the form of C tokens and C "white- 
space" (blanks and comments). It is relatively easy to 
write a lexeme generator that will eventually produce 
every correct token and white-space. What surprised us 
was the kind of bugs that the testing revealed at diis 
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Figure 1 

Rule That Defines the Use of "+" for Addition in C 
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level. One compiler could not handle 0x000001 if 
there were too many leading zeros in the hexadecimal 
number. Another compiler crashed when faced with 
the floating-point constant 1E1000. Many compilers 
failed to properly process digraphs and trigraphs. 

Stochastic Grammar 

A vocabulary is a set of two kinds of symbols: terminal 
and nonterminal. The terminal symbols are what one 
can write down. The nonterminal symbols are names 
for higher level language structures. For example, the 
symbol "+" is a terminal symbol, and the symbol 
"additive-expression" is a nonterminal symbol of the 
C programming language. A grammar is a set of rules 
for describing a language. A rule has a left side and a 
right side. The left side is always a nonterminal sym- 
bol. The right side is a sequence of symbols. The rule 
gives one definition for the structure named by the left 
side. For example, the rule shown in Figure 1 defines 
the use of "+" for addition in C. This rule is recursive, 
defining additive-expression in terms of itself. 

There is one special nonterminal symbol called the 
start symbol. At any time, a nonterminal symbol can be 
replaced by the right side of a rule for which it is the left 
side. Beginning with the start symbol, nonterminals 
can be replaced until there are no more nonterminal 
symbols. The result of many replacements is a sequence 
of terminal symbols. If the grammar describes C, the 
sequence of terminal symbols will form a syntactically 
correct C program. Randomly generated white-space 
can be inserted during or after generation. 

A stochastic grammar associates a probability with 
each grammar rule. 

For level 2, we wrote a stochastic grammar for lex- 
emes and a Tel script to interpret the grammar/- 6 per- 
forming the replacements just described. Whenever a 
nonterminal is to be expanded, a new random number 
is compared with the fixed rule probabilities to direct 
the choice of right side. 

In either case, at this level and at levels 3 through 7, 
setting the many fixed choice probabilities permits 
some control of the distribution of output values. 
Not all assignments of probabilities make sense. The 
probabilities for the right sides that define a specific 
nonterminal must add up to 1.0. The probability of 
expanding recursive rules must be weighted toward a 
nonrecursive alternative to avoid a recursion loop in 
the generator. A system of linear equations can be 
solved for the expected lengths of strings generated by 

ssion + multiplicative-expression 



each nonterminal. If, for some set of probabilities, all 
the expected lengths are finite and nonnegative, this 
set of probabilities ensures that the generator does not 
often run away. 

Increasing Test Quality 

At level 3, given syntactically correct text, one would 
expect to see declaration diagnostics while more thor- 
oughly covering the code in the parser. At this level, 
the generator is unlikely to produce a test program 
that will compile. Nevertheless, compiler errors were 
detected. For example, one parser refused the expres- 
sion 1==1==1. 

The syntax of C is given in the C Standard. Using 
the concept of stochastic grammar, it is easy to write a 
generator that will eventually produce every syntacti- 
cally correct C translation-unit. In fact, we extended 
our Tel lexer grammar to all ofC. 

At level 4, given a syntactically correct generated 
program in which every identifier is declared and all 
expressions are type correct, the lexer, the parser, and a 
good deal of the semantic logic of the compiler are 
covered. Some generated test programs compile and 
execute, giving the first interesting differential testing 
results. Achieving level 4 is not easy but is relatively 
straightforward for an experienced compiler writer. A 
symbol table must be built and the identifier use lim- 
ited to those identifiers that are already declared. The 
requirements for combining arithmetic types in C 
(int, short, char, float;, double with long 
and/or unsigned) were expressed grammatically. 
Grammar rules defining, for example, int-additive- 
expression replaced the rules defining additive-expres- 
sion. The replacements were done systematically for all 
combinations of arithmetic types and operators. To 
avoid introducing typographical errors in the defining 
grammar, much of the grammar itself was generated 
by auxiliary Tel programs. The Tel grammar inter- 
preter did not need to be changed to accommodate 
this more accurate and voluminous grammatical data. 
We extended the generator to implement declare- 



before-use and to provide the derived types of C 
(struct, union, pointer). These necessary 
improvements led to thousands of lines of tricky 
implementation detail in Tel. At this point, Tel, a 
nearly structureless language, was reaching its limits 
as an implementation language. 

At level 5, where the static semantics of the C 
Standard have been factored into the generator, most 
generated programs compile and run. 

Figure 2 contains a fragment of a generated C test 
program from level 5. 

A large percentage of level 5 programs terminate 
abnormally, typically on a divide by-zero operation. A 
peculiarity of C is that many operators produce a 
Boolean value of 0 or 1 . Consequently, a lot of expres- 
sion results are 0, so it is likely for a division operation 
to have a zero denominator. Such tests are wasted. The 
number of wasted tests can be reduced somewhat by 
setting low probabilities for using divide, for creating 
Boolean values, or for using Boolean values as divisors. 

Regarding level 6, dynamic standards violations can- 
not be avoided at generation time without a priori 
choosing not to generate some valid C, so instead we 
implement post-run analysis. For every discovered dif- 
ference (potential bug), we regenerate the same test case, 
replacing each arithmetic operator with a function call, 
inside which there is a check for standards violations. 

The following is a function that checks for "integer 
shift out of range." (If we were testing C++, we could 
have used overloading to avoid having to include the 
type signature in the name of the checking function.) 

int 

int_shl_int_int ( int val, int amt) { 

aesertlamt ■>= 0 fcfr amt < sizeof I int) *8) ; 
return val « amt; 

} 

For example, the generated text 

a << b 

is replaced upon regeneration by the text 

int_shl_int_int (a , b) 



+ • ull5 + -- ui8 * . . ull6 - ( uil7 + ++ ui20 * ( sl21 &. ( argc « = 
cl4 ) ? ( us23 ) < *+ argc <= ++ sl22 ;--((*&*£ sl24 ) ) 
0160030347U < ++ ( C5u7 ) . sit5m6 & 173104443BU * ++ ui25 * ( 
unsigned int ) ++ ( ld26 ))&((( 0761 ) * 2137167721L * sl27 ? 
u!28 & dl2 * x + d9 * DBL_EPSILON * 7e*4 * ** dll ♦ ■>■ + dlO * dl2 * ( 

ld31 * .4L * 9.1 - ld32 * +♦ f 33 - - .7392E-SL * ld34 ♦ 22.82L 
+ 1.91 * ld3S >= ++ ld37 ) =^ 9 . F + ( ++ f 38 ) + ++ f 39 *f40 > ( 
float ) ♦+ f41 * £42 >= cl4 ++ : sc43 k ss44 ) » uc!3 & .9309L - ( 
ui!8 * 007101U * uilS ? sc4S -- ? -- ld47 + ld4l : ♦+ ld49 - ld48 * 
++ ld50 : -+ ld51 ) >= 239.611 ) A - ++ argc ( int signed ) argc - 
++ ui54 )- +♦ ul57 >= ++ ul58 * argc - 9ul * ++ * & ul59 * ++ ul60 ; 



Figure 2 

Generated C Expression 
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If, on being rerun, the regenerated test case asserts a 
standards violation (for example, a shift of more than 
the word length), the test is discarded and testing con- 
tinues with the next case. 

Two problems with the generator remain: ( 1 ) obtain- 
ing enough output from the generated programs so 
that differences are \nsible and (2) ensuring that the 
generated programs resemble real-world programs so 
that the developers are interested in the test results. 
Solving these two problems brings the quality of test 
input to level 7. The tnck here is to begin generating the 
program not from the C grammar nonterminal symbol 
translation-unit but rather from a model program 
described by a more elaborate string in which some of 
the program is already full y generated. As a simple 
example, suppose you want to generate a number of 
print statements at the end of the test program. The 
starting string of the generating grammar might be 

* define P(v) printf(tfv " = %;<\\n", v) 

int main ( ) { 

declaration- list 
statement -list 
print-list 
exit (0) ; 

} 

where the grammatical definition of print-list is 
given by 

print-list P ( identifier ) ; 
print-list print-list P ( identifier ) ; 

In the starting string above there are three nonter- 
minals for the three lists instead of just one for the 
standard C start symbol translation-unit. Programs 
generated from this starting string will cause output 
just before exit. Because differences caused by round- 
ing error were uninteresting to us, we modified this 
print macro for types float and double to print only 
a few significant digits. With a little more effort, the 
expansion of print-list can be forced to print each 
variable exactly once. 

Alternatively, suppose a test designer receives a bug 
report from the field, analyzes the report, and fixes the 
bug. Instead of simply putting the bug-causing case in 
the regression suite, the test designer can generalize it 
in the manner just presented so that many similar test 
cases can be used to explore for other nearby bugs. 

The effect of level 7 is to augment the probabilities 
in the stochastic grammar with more precise and direct 
means of control. 

Forgotten Inputs 

The elaborate command-line flags, config files, and 
environment variables that condition the behavior of 
programs are also input. Such input can also be gener- 
ated using the same toolset that is used to generate the 
test programs. The very first test on the very first run 



with generated compiler directive flags revealed a bug 
in a compiler under test — it could not even compile its 
own header files. 

Results 

Table 1 indicates the kinds of bugs we discovered dur- 
ing the testing. Only those results that are exhibited by 
very short text are shown. Some of the results derive 
from hand generalization of a problem that originally 
surfaced through random testing. 

There was a reason for each result. For example, the 
server crash occurred when the tested compiler got a 
stack overflow on a heavily loaded machine with a very 
large memory. The operating system attempted to 
clump a gigabyte of compiler stack, which caused all 
the other active users to thrash, and many of them also 
dumped for lack of memory. The many disk drives on 
the server began a dance of the lights that sopped up 
the remaining free resources, causing the operators to 
boot the server to recover. Excellent testing can make 
you unpopular with almost everyone. 

Test Distribution 

Each tested or comparison program must be executed 
where it is supported. This may mean different hard- 
ware, operating system, and even physical location. 

There are numerous ways to utilize a network 
to distribute tests and then gather the results. One par- 
ticularly simple way is to use continuously running 
watcher programs. Each watcher program periodically 
examines a common file system for the existence of 
some particular files upon which the program can act. 
If no files exist, the watcher program sleeps for a while 
and tries again. On most operating systems, watcher 
programs can be implemented as command scripts. 

There is a test master and a number of test beds. 
The test master generates the test cases, assigns them 
to the test beds, and later analyzes the results. Each 
test bed runs its assigned tests. The test master and test 
beds share a file space, perhaps via a network. For each 
test bed there is a test input directory and a test output 
directory. 

A watcher program called the test driver waits until 
all the (possibly remote) test input directories are 
empty. The test driver then writes its latest generated 
test case into each of the test input directories and 
returns to its watch-sleep cycle. Foreach test bed there 
is a test watcher program that waits until there is a file 
in its test input directory. When a test watcher finds a 
file to test, the test watcher runs the new test, puts the 
results in its test output directory, and returns to the 
watch-sleep cycle. Another watcher program called 
the test analyzer waits until all the test output directo- 
ries contain results. Then the results, both input and 



104 Digital Technical Journal 



Vol. 10 No. 1 1998 



Table 1 




Results of Testing C Compilers 




Source Code 


Resulting Problem 


if (1 "H 


Constant float expression evaluated false 


17 11/0 


Several compiler crashes 


0 OF/0 OF 


Compiler crash 


A ! — U : A/A . I 


1 nrnrrprr iniu'pr 


1 == 1 == 1 


Spurious svntax error 


-!0 


Spurious type error 


0x000000000000000 

\J /\\J\J\J\J\J\J\J\J\J\J\J\J\J\J\J 


Spurious constant out ot range message 


0x80000000 


Incorrect constant conversion 


1 E1000 


Compiler crash 


1 » INT MAX 


Twenty-minute compile time 


'ab' 


Inconsistent byte order 


int i=sizeof(i = 1); 


Compiler crash 


LDBL MAX 


Incorrect value 


(++n 0} 7 — n- 1 

^ T 1 1 1 1 \J J II* 1 


Operator ++ ignored 


if (sizeof(char)+d) f(d) 


Illegal instruction in code generator 


i— OF" 
i — \ui ijiyi icu/ i .vi , 


Random value 


int f(register()); 


Compiler crash or spurious diagnostic 


int ( (x) )■ 

\ ■ ■ ■ \"/ •••li 


Enough nested parentheses to kill the compiler 




Sm 1 rioi 1 q ni^cnocrif ( 10 nirpnrhpQPQ) 

OL' L 1 1 IULI J 1.1 1 t.i {J, 1 IK..? o L l\- \ L\J LJtllLllLIIV-O^OJ 




Compiler crash ( 100 parentheses) 




Server crash ( 10,000 parentheses) 


digraphs (<: <% etc.) 


Spurious error messages 


a/b 


The famous Pentium divide bug (we did not catch it 




but we could have) 



output, are collected for analysis, and all the files are 
deleted from every test input and output directory, 
thus enabling another cycle to begin. 

Using the file system for synchronization is adequate 
for computations on the scale of a compile-and-execute 
sequence. Because of the many sleep periods, this d istri- 
bution system runs efficiently but not fast. If through- 
put becomes a problem, die test system designer can 
provide more sophisticated remote execution. The dis- 
tribution solution as described is neither robust against 
crashes and loops nor easy to start. It is possible to elab- 
orate the watcher programs to respond to a reasonable 
number of additional requirements. 

Test Analysis 

The test analyzer can compare the output in various 
ways. The goal is to discover likely bugs in the com- 
piler under test. The initial step is to distinguish the 
test results by failure category, using corresponding 
directories to hold the results. If the compiler under 
test crashes, the test analyzer writes the test data to the 
crash directory. If the compiler under test enters an 



endless loop, the test analyzer writes the test data to 
the loop directory. If one of the comparison compilers 
crashes or enters an endless loop, the test analyzer dis- 
cards the test, since reporting the bugs of a compari- 
son compiler is not a testing objective. If some, but 
not all, of the test case executions terminate abnor- 
mally, the test case is written to the abend directory. If 
all the test cases run to completion but the output dif- 
fers, the case is written to the test diff directory. 
Otherwise, the test case is discarded. 

Test Reduction 

A tester must examine each filed test case to determine 
if it exposes a fault in the compiler under test. The first 
step is to reduce the test to the shortest version that 
qualifies for examination. 

A watcher called the crash analyzer examines the 
crash directory for files and moves found files to a 
working directory. The crash analyzer then applies a 
shortening transformation to the source of the test 
case and reruns the test. If the compiler under test still 
crashes, the original test case is replaced by the short- 
ened test case. Otherwise, the change is backed out 
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and a new transformation is tried. We used 23 heuris- 
tic transformations, including 

■ Remove a statement 

■ Remove a declaration 

■ Change a constant to 1 

■ Change an identifier to 1 

■ Delete a pair of matching braces 

■ Delete an if clause 

When all the transformations have been systematically 
tried once, the process is started over again. The 
process is repeated until a whole cycle leaves the 
source of the test unchanged. A similar process is used 
for the loop, abend, and diff directories. 

The typical result of the test reduction process is to 
reduce generated C test programs of 500 to 600 lines 
to equally useful C programs of only a few lines. It is 
not unusual to use 10,000 or more compile opera- 
tions during test reduction. The trade-off is using 
many computer cycles instead of human effort to ana- 
lyze the ugly generated test case. 

Test Presentation 

After the shortest form of the test case is ready, the test 
analyzer wraps it in a command script that 

1. Reports environmental information (compiler ver- 
sion, compiler flags, name of the test platform, time 
of test, etc.) 

2. Reports the test output or crash information 

3. Reruns the test (the test input is embedded in the 
script) 

The test analyzer writes the command scripts to a 
results directory. 

Test Evaluation and Report 

The person who is managing the differential testing 
setup periodically runs scripts that have accumulated in 
the results directory to determine which ones expose a 
problem of interest to the development team. One 
problem peculiar to random testing is that once a bug 
is found, it will be found again and again until it is 
fixed. This argues the case for giving high priority to 
the bugs exposed by differential testing. Uninteresting 
and duplicate tests are manually discarded, and die rest 
are entered into the development team bug queue. 

Summary and Directions 

Differential testing, suitably tuned to the tested 
program, complements traditional software testing 
processes. It finds faults that would otherwise remain 
undetected. It is cost-effective. It is applicable to a 
wide range oflarge software. It has proven unpopular 
with the developers of the tested software. 



This technology exposed new bugs in C compilers 
each day during its use at DIGITAL. Most of the bugs 
were in the comparison compilers, but a significant 
number of bugs in DIGITAL code were found and 
corrected . 

Numerous special-purpose differential testing har- 
nesses were put into use at DIGITAL, each testing 
some small part of a large program. For example, the 
C preprocessor, multidimensional Fortran arrays, 
optimizer constant folding, and a new printf func- 
tion each were tested by ad hoc differential testers. 

The Java API (run-time library) is a large body of 
relatively new code that runs on a wide variety of plat- 
forms. Since "Write once, run anywhere" is the Java 
motto, the standard for conformance is high; however, 
experience has shown that the standard is difficult to 
achieve. Differential testing should help. What needs 
to be done is to generate a sequence of calls into the 
API on various Java platforms, comparing the results 
and reporting differences. Technically, this procedure 
is much simpler than testing C compilers. Chris Rohrs, 
an MIT intern at DIGITAL, wrote a system entirely in 
Java, gathering method signature information directly 
out of the binary class files. This API tester may be 
used when the quality of the Java API reaches the 
point where the implementors are not buried in bug 
reports and when there are more independent imple- 
mentations of the Java run time. 

Differential testing can be used to increase test cov- 
erage. Using the coverage data taken from running 
the standard regression suite as a baseline, the devel- 
opers can run random tests to see if coverage can 
be increased. Developers can freely add coverage- 
increasing tests to the test suite using the test output as 
an initial oracle. No harm is done because even if the 
recorded result is wrong, the compiler is no worse off 
for it. If at a later time a regression is observed on the 
generated test, either the new or the old version was 
wrong. The developers are alerted and can react. John 
Parks and John Hale applied this technology to 
DIGITAL'S C compilers. 

The problem of retiring an old compiler in favor of a 
new one requires the new one to duplicate old behavior 
so as not to upset the installed base. Differential testing 
can compare the old and die new, flagging all new 
results (correct or not) that disagree widi die old results. 

Differential testing can be used to measure quality. 
Supposing that the majority rules, a million tests can 
be run on a set of competing compilers. The metric is 
failed tests per million runs. The authors of the failed 
compilers can either fix the bugs or prove the majority 
wrong. In any case, quality improves. 

At Compaq, differential testing opportunities arise 
regularly and are often satisfied by testing systems that 
arc less elaborate than the original C testing system, 
which has been retired. 
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